Person ReID tries to answer the ffective guidance, their question: "Which gallery images belong to a certain probe person?" and it returns a sorted list of the gallery persons in descendi
Trang 1MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
NGUYEN THUY BINH
Trang 2MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
NGUYEN THUY BINH
PERSON RE-IDENTIFICATION
IN A SURVEILLANCE CAMERA NETWORK
Major: Electronics Engineering Code: 9520203
Trang 3DECLARATION OF AUTHORSHIP
I, Nguyen Thuy Binh, declare that the thesis titled "Person re-identification in asurveillance camera network" has been entirely composed by myself I assuresome points as follows:
This work was done wholly or mainly while in candidature for a Ph.D researchdegree at Hanoi University of Science and Technology
The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions
Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others
The thesis submitted is my own, except where work in the collaboration has been included The collaborative contributions have been clearly indicated
Hanoi, 24/11/ 2020PhD Student
SUPERVISORS
i
Trang 4This dissertation was written during my doctoral course at School of Electronicsand Telecommunications (SET) and International Research Institute of Multimedia,Infor-mation, Communication and Applications (MICA), Hanoi University ofScience and Technology (HUST) I am so grateful for all people who alwayssupport and encourage me for completing this study
First, I would like to express my sincere gratitude to my advisors Assoc Prof.Pham Ngoc Nam and Assoc Prof Le Thi Lan for their e ective guidance, theirffective guidance, theirpatience, continuous support and encouragement, and their immense knowledge
I would like to express my gratitude to Dr Vo Le Cuong and Dr Ha thi Thu Lanfor their help I would like to thank to all member of School of Electronics andTelecom-munications, International Research Institute of Multimedia, Information,Communi-cations and Applications (MICA), Hanoi University of Science andTechnology (HUST) as well as all of my colleagues in Faculty of Electrical-Electronic Engineering, University of Transport and Communications (UTC) Theyhave always helped me on research process and given helpful advises for me toovercome my own di culties Moreover, the attention at scientific conferences hasfficulties Moreover, the attention at scientific conferences hasalways been a great experience for me to receive many the useful comments.During my PhD course, I have received many supports from the ManagementBoard of School of Electronics and Telecommunications, MICA Institute, and Faculty
of Electrical-Electronic Engineering My sincere thank to Assoc Prof Nguyen HuuThanh, Dr Nguyen Viet Son and Assoc Prof Nguyen Thanh Hai who gave me a lot ofsupport and help Without their precious support, it has been impossible to conductthis research Thanks to my employer, University of Transport and Communications(UTC) for all necessary support and encouragement during my PhD journey I am alsograteful to Vietnam’s Program 911, HUST and UTC projects for their generousfinancial support Special thanks to my family and relatives, particularly, my belovedhusband and our children, for their never-ending support and sacrifice
Hanoi, 2020Ph.D Student
ii
Trang 5DECLARATION OF AUTHORSHIP i
ACKNOWLEDGEMENT ii
CONTENTS
vi SYMBOLS
vi LIST OF TABLES x
LIST OF FIGURES
xiv INTRODUCTION 1
CHAPTER 1 LITERATURE REVIEW 8
1.1 Person ReID classifications 8
1.1.1 Single-shot versus Multi-shot 8
1.1.2 Closed-set versus Open-set person ReID 9
1.1.3 Supervised and unsupervised person ReID
10 1.2 Datasets and evaluation metrics 11
1.2.1 Datasets 11
1.2.2 Evaluation metrics
16 1.3 Feature extraction 16
1.3.1 Hand-designed features 17
1.3.2 Deep-learned features
20 1.4 Metric learning and person matching 25
1.4.1 Metric learning 25
1.4.2 Person matching 28
1.5 Fusion schemes for person ReID
29 1.6 Representative frame selection
31 1.7 Fully automated person ReID systems .
33 1.8 Research on person ReID in Vietnam
34 CHAPTER 2 MULTI-SHOT PERSON RE-ID THROUGH REPRESEN-TATIVE FRAMES SELECTION AND TEMPORAL FEATURE POOLING 36
Trang 62.1 Introduction 36
2.2 Proposed method 36
2.2.1 Overall framework 36
2.2.2 Representative image selection
37
iii
Trang 72.2.3 Image-level feature extraction 44
2.2.4 Temporal feature pooling 49
2.2.5 Person matching 50
2.3 Experimental results 55
2.3.1 Evaluation of representative frame extraction and temporal feature pooling schemes
55 2.3.2 Quantitative evaluation of the trade-o between the accuracy and compu-tational ffective guidance, their time
61 2.3.3 Comparison with state-of-the-art methods
63 2.4 Conclusions and Future work .
65 CHAPTER 3 PERSON RE-ID PERFORMANCE IMPROVEMENT BASED ON FUSION SCHEMES 67
3.1 Introduction 67
3.2 Fusion schemes for the first setting of person ReID 69
3.2.1 Image-to-images person ReID
69 3.2.2 Images-to-images person ReID 75
3.2.3 Obtained results on the first setting
76 3.3 Fusion schemes for the second setting of person ReID 82
3.3.1 The proposed method
82 3.3.2 Obtained results on the second setting
86 3.4 Conclusions
89 CHAPTER 4 QUANTITATIVE EVALUATION OF AN END-TO-END PERSON REID PIPELINE 91
4.1 Introduction 91
4.2 An end-to-end person ReID pipeline .
92 4.2.1 Pedestrian detection
92 4.2.2 Pedestrian tracking
97 4.2.3 Person ReID 98
4.3 GOG descriptor re-implementation 99
4.3.1 Comparison the performance of two implementations
99
4.3.2 Analyze the e ect of GOG parameters ffective guidance, their
Trang 84.4 Evaluation performance of an end-to-end person ReID pipeline 101 4.4.1 The
e ect of human detection and segmentation on person ReID in single-shot ffective guidance, their scenario 102
iv
Trang 94.4.2 The e ect of human detection and segmentation on person ReID in multi-shot ffective guidance, their scenario 104
4.5 Conclusions and Future work
Bibliography 113
v
Trang 10No Abbreviation Meaning
2 AIT Austrian Institute of Technology
3 AMOCAccumulative Motion Context
6 CIE The International Commission on Illumination
11 CVPDL Cross-view Projective Dictionary Learning
12 CVPR Conference on Computer Vision and Pattern Recognition
13 DDLM Discriminative Dictionary Learning Method
15 DeepSORT Deep learning Simple Online and Realtime Tracking
19 ECCV European Conference on Computer Vision
20 FAST 3D Fast Adaptive Spatio-Temporal 3D
23 FPNN Filter Pairing Neural Network
27 HUST Hanoi University of Science and Technology
28 IBP Indian Bu et Processffective guidance, their
29 ICCV International Conference on Computer Vision
30 ICIP International Conference on Image Processing
vi
Trang 1131 IDE ID-Discriminative Embedding
32 iLIDS-VIDImagery Library for Intelligent Detection Systems
33 ILSVRCImageNet Large Scale Visual Recognition Competition
35 KCF Kernelized Correlation Filter
37 KISSME Keep It Simple and Straightforward MEtric
39 KXQDA Kernel Cross-view Quadratic Discriminative Analysis
40 LADF Locally-Adaptive Decision Functions
43 LDFV Local Descriptor and coded by Feature Vector
44 LMNNLarge Margin Nearest Neighbor
45 LMNN-RLarge Margin Nearest Neighbor with Rejection
46 LOMOLOcal Maximal Occurrence
48 LSTMC Long Short-Term Memory network with a Coupled gate
50 MAPR Multimedia Analysis and Pattern Recognition
51 Mask R-CNN Mask Region with CNN
54 MCMLMaximally Collapsing Metric Learning
55 MGCAMMask-Guided Contrastive Attention Model
57 MLAPG Metric Learning by Accelerated Proximal Gradient
62 MTMCT Multi-Target Multi-Camera Tracking
63 Person ReID Person Re -Identification
64 PedparsingPedestrian Parsing
vii
Trang 1266 PRW Person Re-identification in the Wild
67 QDA Quadratic Discriminative Analysis
68 RAiD Re-Identification Across indoor-outdoor Dataset
71 RHSP Recurrent High-Structured Patches
72 RKHS Reproducing Kernel Hilbert Space
75 SDALFSymmetry Driven Accumulation of Local Feature
76 SCNCDSalient Color Names based Color Descriptor
78 SIFT Scale-Invariant Feature Transform
79 SILTP Scale Invariant Local Ternary Pattern
82 SORT Simple Online and Realtime Tracking
85 TAPR Temporally Aligned Pooling Representation
86 TAUDLTracklet Association Unsupervised Deep Learning
87 TCSVTTransactions on Circuits and Systems for Video Technology
88 TII Transactions on Industrial Informatics
89 TPAMI Transactions on Pattern Analysis and Machine Intelligence
91 Two-stream MR Two-stream Multirate Recurrent Neural Network
92 UIT University of Information Technology
93 UTAL Tracklet Association Unsupervised Deep Learning
94 VIPeR View-point Invariant Pedestrian Recognition
95 VNU-HCM Vietnam National University - Ho Chi Minh City
97 WHOS Weighted Histograms of Overlapping Stripes
99 XQDA Cross-view Quadratic Discriminative Analysis
viii
Trang 13LIST OF TABLES
1.1 Benchmark datasets used in the thesis 14
2.1 The matching rates (%) when applying di erent pooling methods on di erentffective guidance, their ffective guidance, theircolor spaces in case of using four key frames on PRID 2011 dataset The two best results for each case are in bold 56
2.2 The matching rates (%) when applying di erent pooling methods onffective guidance, their
di erent color spaces in case of using frames within a walking cycle onffective guidance, theirPRID 2011 dataset The two best results for each case are in bold 562.3 The matching rates (%) when applying di erent pooling methods onffective guidance, their
di erent color spaces in case of using all frames on PRID 2011 dataset.ffective guidance, their
The two best results for each case are in bold 57 2.4 The matching rates (%) when applying di erent pooling methods onffective guidance, their
di erent color spaces in case of using four key frames on iLIDS-VID dataset.ffective guidance, theirThe two best results for each case are in bold 58
2.5 The matching rates (%) when applying di erent pooling methods on di erentffective guidance, their ffective guidance, theircolor spaces in case of using frames within a walking cycle on iLIDS-VID dataset The two best results for each case are in bold 58
2.6 The matching rates (%) when applying di erent pooling methods onffective guidance, their
di erent color spaces in case of using all frames on iLIDS-VID dataset.ffective guidance, their
The two best results for each case are in bold 59 2.7 Matching rates (%) in several important ranks when using four key
frames, four random frames, and one random frame in PRID-2011 and iLIDS-VID datasets 61
2.8 Comparison of the three representative frame selection schemes in term
of accuracy at rank-1, computational time, and memory requirement onPRID 2011 dataset 61
2.9 Comparison between the proposed method and existing works on PRID
2011 and iLIDS-VID datasets Two best results are in bold 663.1 Matching rates (%) in case of images-to-images on CAVIAR4REID (case B).803.2 Matching rates (%) in case of images-to-images person ReID on the
RAiD dataset 80
3.3 Comparison the best matching rates at rank-1 in image-to-images case
and those of in images-to-images one 80
ix
Trang 14Comparison of images-to-images and image-to-images schemes at
rank-1 (*) means the obtained results by applying the proposed strategiesover 10 random trials in case A of CAVIAR4REID 80 3.5 Comparison between the proposed method and existing works on PRID
2011 and iLIDS-VID datasets.Two best results are in bold 884.1 Comparison of the proposed method with state of the art methods for
PRID 2011 (the two best results are in bold) 107
x
Trang 15LIST OF FIGURES
1 The ranked list of gallery person corresponding to the given query based
on the similarities between the query and each of gallery ones 2
2 An example for challenges caused by variations in a) illumination b) view-point 3
3 A person has multiple images captured in di erent camera-views 4ffective guidance, their 4 A fully-automatic person ReID system consisting of three main stages: human detection, tracking and re-identification 4
1.1 Some important milestones for person ReID problem [8] Several ap-proaches related to this thesis are bounded by red blocks 8
1.2 An example for a) single-shot (image-based) and b)multi-shot person (video-based) ReID approaches 9
1.3 The di erences between a) Closed-set and b) Open-set person ReID In ffective guidance, their closed-set person ReID, an individual appears on at least two camera-views Inversely, in open-set person ReID, a pedestrian might appear on only one camera-view 10 1.4 Two popular settings for person ReID problem: a) The testing persons have appeared in the training set (represented by the same colors) b) Persons in the training and testing sets are absolutely di erent 12ffective guidance, their 1.5 Camera layout for PRID-2011 dataset [33] 13
1.6 iLIDS-VID is captured by five non-overlapping cameras [36] 15
1.7 Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID c) RAiD d) PRID-2011 e) iLIDS-VID For the first three datasets (VIPeR, CAVIAR4REID, and RAiD), images in the same column belong to the same person while for the last two datasets, images in the same row represent for the same person 15
1.8 An example of CMC curves obtained with two methods: Method #1 and Method #2 16 1.9 Proposed framework using Siamese Convolutional Neural Network (SCNN) [54] a) Overall framework b) structure of a typical SCNN 20 1.10 Structure of a) an inception block b) a typical GoogLeNet [62] 23
1.11 Structure of a) an inception block b) ResNet-50 [64] 24
1.12 ResNet architecture [64] 24
1.13 Example for metric learning 25 1.14 Di erent strategies for a) early fusion b) late fusion .ffective guidance, their 29
xi
Trang 162.1 The proposed framework consists of four main steps: representative
im-age selection, image-level feature extraction, temporal feature pooling
and person matching 37
2.2 An example for a normal walking cycle of a pedestrian 38
2.3 An example for motion of pixels in the two subsequent frames 39
2.4 An example for computed Vx and Vy values on every frames in a given sequence of images The blue and red dots present minimum and maxi-mum values in Vx and Vy, respectively 40 2.5 Representative frame selection The first row describes an image se-quence of a person, the second row indicates the related original FEP (blue curve) and the regulated FEP (red curve) A walking cycle and four key frames extracted from this cycle are shown in the third and the last rows, respectively 41
2.6 An example for Gaussian filter with = 0 42
2.7 a) Random walking cycles of some person in PRID-2011 datasets and b) Four key frames in a walking cycle 43 2.8 (a) A person image is divided into patches and regions; (b) Pipeline for GOG feature extraction [49] 44
2.9 RGB color space 46
2.10 CIE L a b color space [127] 47
2.11 HSV color space [129] 48
2.12 Three di erent feature pooling techniques for person representation .ffective guidance, their 49 2.13 Distribution of I and and E in one projected dimension [45] 52
2.14 Evaluation the performance of GOG features on a) PRID 2011 dataset and b) iLIDS-VID dataset with three di erent representative frame se-ffective guidance, their lection scheme 59
2.15 Matching rates when selecting 4 key frames or 4 random frames for person representation in a) PRID-2011 and iLDIS-VID 60
2.16 The distribution of frames for each person in PRID 2011 dataset with a) camera A view and b) camera B view 62
3.1 Image-to-images person ReID scheme 69
3.2 Extracting KDES feature (best viewed in color) 70
3.3 An example for the e ectiveness of GOG and ResNet features on di er-ffective guidance, their ffective guidance, their ent query persons 73
3.4 Proposed framework for images-to-images person ReID without tempo-ral linking requirement 76
xii
Trang 173.5 Evaluation the performance of three chosen features (GOG, KDES,
CNN) over 10 trials on (a) case A (b)
CAVIAR4REID-case B (c) RAiD datasets in image-to-images CAVIAR4REID-case 78 3.6
Comparison the performance of the three fusion schemes when using
two or three features over 10 trials on (a) CAVIAR4REID-case A (b)
CAVIAR4REID-case B (c) RAiD datasets in image-to-images case 79
3.7 CMC curves in case A of images-to-images person ReID on the
CAVIAR4REID dataset 79
3.8 An example result of SvsM and MvsM scenarios Each row in SvsM
scenario are the first five ranked persons for each query image obtained
by using image-to-images scheme on three features Person in red box
is the true matched person 81 3.9 The
proposed method for video-based person ReID by combining the
fusion scheme with metric learning technique 83 3.10
Matching rates with di erent fusion schemes on PRID-2011 dataset withffective guidance, their
a) four key frames b) frames within a walking cycle c) all frames 85
3.11 Matching rates with di erent fusion schemes on iLID-VID dataset a)ffective guidance, their
four key frames b) frames within a walking cycle c) all frames 863.12 Average weights for GOG and ResNet features on a random split in a)
PRID-2011 and b) iLIDS-VID 874.1 A fully person ReID pipeline including person detection, segmentation,
tracking and person ReID steps 924.2 An example for automatic person detection and segmentation results on
PRID 2011 dataset 934.3 An overview of ACF detector [109] 944.4 Fast feature pyramid in ACF detector [109] 944.5 a) An input image is divided in 7 7 grid cell b) The architecture of an
YOLO detector [152] 954.6 The architecture of a) Faster R-CNN [150] and b) Mask R-CNN [111] 964.7 DDN architecture for Pedestrian Parsing [112] 974.8 a) ReID accuracy of the source code provided in [49] and that of the
re-implementation and b) computation time (in s) for each step in
ex-tracting GOG feature on an image in C++ 1004.9 The matching rates at rank-1 with di erent number of regions (N) .ffective guidance, their 1004.10 CMC curves on VIPeR dataset when extracting GOG features with the
optimal parameters 1014.11 CMC curves of three evaluated scenarios on VIPER dataset when ap-
plying the method proposed in Chapter 2 102
xiii
Trang 184.12 Examples for results of a)segmentation and b), c) person ReID in all
three cases of using the original images, manually segmented images,
automatically segmented images of two di erent persons in VIPeR dataset 103ffective guidance, their
4.13 CMC curves of three evaluated scenarios on PRID 2011 dataset in
single-shot approach (a) Without segmentation and (b) with segmentation 103
4.14 CMC curves of three evaluated scenarios on PRID 2011 dataset when
applying the proposed method in Chapter 2 1054.15 Examples for results of a)human detection and segmentation and b),
c) person ReID in all three cases of using the original images,
manu-ally segmented images, automaticmanu-ally segmented images of two di erentffective guidance, their
persons in PRID-2011 dataset 106
xiv
Trang 19Motivation
Person ReID is known as associating cross-view images of the same person when he/ she moves in a non-overlapping camera network [1] In recent years, along with the development of surveillance camera systems, person re-identification (ReID) has increasingly attracted the attention of computer vision and pattern recognition com- munities because of its promising applications in many areas, such as public safety and security, human-robotic interaction, and person retrieval In early years, person ReID was considered as the sub-task of Multi-Camera Tracking (MCT) [2] The purpose of MCT is to generate tracklets in every single field of view (FoV) and then associate the tracklets that belong to the same pedestrian in di erent FoVs In 2006, Gheissari et al [3] firstly ffective guidance, their considered person ReID as an independent task On a certain aspect, person ReID and Multi-Target Multi-Camera Tracking (MTMCT) are close to each other However, the two issues are fundamentally di erent from each other in terms of objective and evaluation ffective guidance, their metrics While the objective of MTMCT is to determine the position of each pedestrian over time from video streams taken by di erent cameras Person ReID tries to answer the ffective guidance, their question: "Which gallery images belong to a certain probe person?" and it returns a sorted list of the gallery persons in descending order of the similarities to the given query person.
If MTMCT classifies a pair of images as co-identical or not, person ReID ranks the gallery persons corresponding to the given query person Therefore, their performance is evaluated by di erent metrics: classifi-cation error rates for MTMCT and ranking ffective guidance, their performance for ReID It is worth noting that in case of overlapping camera network, the corresponding images of the same per-son would be found out based on data association, and can be considered as person tracking problem, which is out of scope of this thesis In the last decade, with the un-remitting e orts, person ReID has achieved numerous ffective guidance, their important milestones with many great results [4, 5, 6, 7, 8], however, it is still a challenging task and confronts various di culties These di culties and challenges will be presented fficulties Moreover, the attention at scientific conferences has fficulties Moreover, the attention at scientific conferences has
in the later section First of all, the mathematical formulation of person REID is given as follows.
Problem formulation
In person ReID, the dataset is divided into two sets: probe and gallery Noted thatprobe and gallery sets are captured in at least two non-overlapping field of cameraviews Given a query person Qi and N persons in gallery Gj, where j = 1; N Qi and
1
Trang 20Gj are represented as follows:
Depending on the number of images used for person representation, personReID can be categorized into single shot where one sole image is used ormultishot where several images are available
The identity of the given query person Qi is determined as follows [9]:
j = arg min d (Qi; Gj) ;
j
where d (Qi; Gj) is defined as the distance between the given query person Qi and
a gallery person Gj This distance can be calculated directly or learned through ametric learning method It is worth noting that in another definition, similaritybetween two pedestrians is used instead of distance between them In this case,the identity of the give query person Qi is defined as follows:
Problem formulationProblem Problem formulationformulation Person Problem formulationRe-identificationj
(0.3)
j = arg max Sim (Qi; Gj) ;
– The returned result of person ReID is a gallery person who hasprobethe small query st/largest
persondistace/similarity to the given query person However, in order to evaluate the
perfor-– mancePersonsofainpersongalleryReIDset method, a ranked list of the gallery persons is provided This
list is ranked in ascending/descending order of distance/similarity to the given query
Output
person Figure 1 shows an example of ranked list gallery person corresponding to the
– A Problem formulationlist Problem formulationof Problem formulationpersons in gallery is ranked by the similarity between the person in given query based on the similarities between the given query and each of gallery ones gallery and the query person
similarities between the query and each of gallery ones.
2
Trang 21Firstly, the strong variations in illuminations, view-points, and poses are the eral di culties in any image processing problem These factors make the appear-fficulties Moreover, the attention at scientific conferences hasance discrepancies of the same person even larger than those of di erent ffective guidance, their
gen-persons Consequently, one of the crucial task in person ReID is to build not only dis-criminative but also visual descriptor for person representation This
descriptor ensures to highlight the characteristics of each individual and helps to distinguish between di erent persons more easily Figure 2 illustrates the ffective guidance, their
variations in illu-minations and view-points This Figure shows that color of
pedestrian’s clothes are significantly changed due to the variations in
illuminations and view-points Pairs of images in the same column present the same person, and are captured in the two di erent camera views.ffective guidance, their
Figure 2: An example for challenges caused by variations in a) illumination b) view-point.
The second challenge is the large number of images for each person in a camera view and the number of persons in examined datasets The number of identities as well as images in some evaluated datasets have grown rapidly in recent years The early datasets have only hundreds of identities and thousands of images, whereas there are more than thousands of identities and millions of images in the latest dataset This results in a significant burden on memory capacity requirement, execution speed and computation complexity when solving person ReID issue Figure 3 shows some images of the same person captured in di erent camera views ffective guidance, their
Besides, the number of images for each person in the existing datasets varies greatly For example, in PRID-2011 dataset, some persons have only 20
3
Trang 22images, meanwhile others may have hundreds images This leads to unbalance in person
representationSome examplesandalsocausesofmatchingdi culties fficulties Moreover, the attention at scientific conferences has Resultsforperon ReID.
Query track Person 0002-Cam 1
Figure 3: A person has multiple images captured in di erent camera-views ffective guidance, their
The third challenge is the e ect of human detection and tracking results In a ffective guidance, their
fully-automatic surveillance system, person ReID task is the last stage whose
in-puts are the outcomes of human detection and tracking stages as illustrated in
Fig 4 The performance of the two previous stages greatly a ects the overall ffective guidance, their
perfor-mance Most of existing studies deal with human regions of interests
(ROIs) that are manually detected and segmented with well-aligned bounding
boxes Never-theless, in an automatic surveillance system, many problems and
errors appear in human detection and tracking, such as false detection, ID switch,
fragment, etc Consequently, these errors might cause an reduction in ReID
accuracy Though the latest methodology-driven methods surpass the
human-level performance in several commonly used benchmark datasets, improving
accuracy for application-driven ReID is still a non-trivial task
Person Problem
Figure 4: A fully-automatic person ReID system consisting of three main stages: human detection,
tracking and re-identification.
Based on the above analysis, person ReID is undoubtedly an interesting issue but
chal-4
Trang 23lenging task As widely observed, person ReID has wide range application fieldsdespite the fact that it has to cope with a lot of di culties not only from realisticfficulties Moreover, the attention at scientific conferences hasenvironment but also from hardware requirements This motivates us to examineperson ReID with di erent aspects This research focuses on both methodology-ffective guidance, theirdriven and application-driven person ReID trends in order to provide acomprehensive understanding of this issue The following section discusses theresearch objectives, context, constraints and datasets in more detail.
Objective
The objectives of this research are as follows:
Robust person representation for multi-shot person ReID The first objective
of this dissertation is to find a novel method to reduce computation cost as well as memory requirement but still retain accuracy in video-based person ReID ap-proach In video-based person ReID, computation cost and memory requirement are the two main issues Working with a large number of images
is the burden of any surveillance system This results in a high computation cost as well as a large memory capacity
Improve the accuracy through fusion schemes Improving accuracy is the most important target of person ReID In order to achieve this target, some existing works have concentrated on building an e ective descriptor for ffective guidance, their
person represen-tation However, each kind of features has its own influence
on the given image A feature might be e ective for a given image but not for ffective guidance, theiranother one Besides, the complexity and diversity of current datasets have been increased day by day Therefore, the second objective of this thesis is toimprove the person ReID accu-racy based on feature fusion schemes which wish to take advantage of all kinds of features
Evaluate the overall performance of an end-to-end person ReID pipeline As dis-cussed above, a practical person ReID system has three main stages including human detection, tracking, and person ReID Most existing person ReID studies only focus on the person ReID stage but not address the two first stages In the expectation of the provision of obtained results in person ReID to a surveillance system, the final objective is to evaluate the overall performance of an end-to-end person ReID pipeline
Context, constraints and datasets
Context and Constraints
Actually, there are some di erent approaches to solve person ReID problem with dif-ffective guidance, their
5
Trang 24ferent contexts In this thesis, the author give some context and constraints that are listed as below.
Images/videos are captured in daylight conditions Region of Interests (RoI), also called bounding boxes, are generated based on manual or automatic manners through human detection These bounding boxes are discrete or consecutive These bounding boxes might have temporal constraint or not.Focusing on short-term person ReID in which the appearance and clothes of each pedestrian do not change during a certain period of time In addition, pedestrians do not wear uniform
Person ReID is solved in the close-set approach in which each person has to appear on at least two cameras
Contributions
The two main contributions are introduced in this dissertation
Contribution 1: As in multi-shot person ReID, a huge number of images can be captured for a person The use of all these images for person representation and matching may require a high memory capacity and computational time
Moreover, we observe that there exists some repeated walking cycle while a pedestrian walks within a camera’s field of view This suggests to extract several representative frames for person representation instead of using all images Therefore, in this thesis, an e ective method for multi-shot person ReID ffective guidance, their
consisting of four main steps: representative frame selection, image feature extraction, feature pooling and person matching is proposed There are two types
of representative frames that are frames within a walking cycle and four key frames of a walking cycle are considered in this work
Contribution 2: Some previous studies have proved that each feature has its own discriminative power for person representation, therefore, in order to lever-age the considered features, di erent late fusion schemes are proposed for both ffective guidance, theirsettings of multi-shot person ReID In the first setting, person ReID is treated as information retrieval problem, in which the identity of the given query person is determined through the probability of his/her images belonging to each of trained appearance models In the second setting, a combination of metric learning and fusion scheme is proposed to improve the person ReID accuracy Instead of using equal weights, feature weights are adaptively determined for each query based on the query characteristics A larger weight will be assigned to a more
e ective feature for the given query image.ffective guidance, their
6
Trang 25Dissertation outline
In addition to the introduction and conclusion, the dissertation consists of four chapters and is structured as follows
Introduction: This chapter provides the main motivations, objectives of the
thesis as well as contributions, constraints, and challenges to the research.Chapter 1 entitled "Literature Review": This chapter reviews and synthesizes the existing literature in order to obtain a comprehensive understanding of previous researches related to person ReID
Chapter 2 entitled "Multi-shot person ReID through representative frames
selec-tion and temporal feature pooling" : This chapter presents an e ective ffective guidance, theirframework for person ReID This framework helps overcome the di culties fficulties Moreover, the attention at scientific conferences haswhen dealing with video-based person ReID
Chapter 3 entitled "Person ReID performance improvement based on fusion schemes": This chapter introduces several fusion schemes for person ReID
Di erent kinds of features are integrated into both early and late fusion schemes.ffective guidance, theirThe proposed fusion schemes are evaluated in both settings of person ReID.Chapter 4 entitled "Quantitative evaluation of and end-to-end person ReID pipeline": This chapter focuses on person ReID performance when considering the a ect of human detection and tracking steps From this, the chapter aims to ffective guidance, theiranswer the question: How to overcome the influence of the two first steps on person ReID with the state of the art methods for person ReID?
Conclusion and future work: This section summarizes the contributions of this thesis, and introduces some future works for person ReID problem
7
Trang 26Chapter 1 LITERATURE REVIEW
A broad view on person ReID problem is provided through the timeline drawn
by Leng et al [8] in Figure 1.1
As shown in this figure, several remarkable concepts for person ReID are
multi-shot, metric-learning, open-set ReID, end-to-end ReID, and so on Depending on the
numbers of images of a person, the availability of training set or the appearance of the
query person in the gallery set, person reID techniques are categorized into
Single-Important Problem formulationmilestones
shotPersonvsusmulti-Reshot,-Closeidentification-setversusOpen-set,Supervised and unsupervised learning.
Next, the author would like to described the taxonomy used in person ReID
Visible-thermal ReID
[Leng Problem formulationet Problem formulational, Problem formulation2019]
Figure 1.1: Some important milestones for person ReID problem [8] Several approaches related to
this thesis are bounded by red blocks.
71.1 Person ReID classifications
1.1.1 Single-shot versus Multi-shot
The di erence between single-shot (image-based) and multi-shot (video-based) ap-ffective guidance, their
proaches are illustrated in Fig 1.2 According to the aforementioned definition of
person ReID, in single-shot scenarios, the number of images for query person and
per-son in gallery sets is one while in multi-shot scenarios, the value of ni and mj are
greater than one Early person ReID studies only focused on single-shot approach in
which person matching is mainly relied on comparison between two images (one in
8
Trang 27probe and another in gallery) [10, 11, 12, 13] By contrast, in multi-shot person ReID,each pedestrian is described by multiple images or sequences In 2010, the firststudies on multi-shot person ReID were reported [14, 15] On the one hand, althoughsingle-shot scenario is seem to be far from realistic applications, obtained results forthis case would be a crucial brick for multi-shot Computational cost as well asmemory stor-age requirements for single-shot problem are much lower than those forthe multi-shot person ReID On the other hand, multi-shot person ReID can providericher and more useful information to improve the ReID accuracy However, multi-shotperson ReID has its own issues such as memory storage requirement, computationtime To solve this problem, several studies have introduced some solutions to extractkey frames which contain su cient information to describe a pedestrian Thisfficulties Moreover, the attention at scientific conferences hasapproach not only helps to remove redundant information but also increasescalculation speed and saves memory capacity These key frames are extracted based
on some cluster algorithms or distribution of motion energy
1.1.2 Closed-set versus Open-set person ReID
In a broad view, person ReID can be defined as closed-set and open-set problems.Majority of the existing person ReID works belong to the former approach with theassumption that each individual appears in both probe and gallery sets Also, this taskcan be understood as the matching problem which aims to seek the occurrences of aquery person (probe) from a set of person candidates (gallery), called closed-setperson ReID However, the hypothesis that images of the same person are captured
by di erent cameras is not always satisfied and limits the capability of practical ap-ffective guidance, theirplications Therefore, the open-set person ReID has become an inevitable trend andreceived more and more attention of the researchers all over the world The open-setperson ReID aims to answer the question: "Does a query person appear in the galleryset?" In this case, the query probe might appear in the gallery set or not and open-set
Trang 28person ReID is turned into person verification [16, 17, 18, 19, 20] This approach
can be used in suspect retrieval if multiple images of a suspect or a criminal are
available Figure 1.3 shows an example of the closed-set and open-set person
ReID In Figure 1.3a) the person appears on both cameras, while she appears
only on the camera-A in Figure 1.3b)
Camera-A
Camera-B
Camera-A
Camera-B
Figure 1.3: The di erences between a) Closed-set and b) Open-set person ReID In closed-set ffective guidance, their
person ReID, an individual appears on at least two camera-views Inversely, in open-set person
ReID, a pedestrian might appear on only one camera-view.
1.1.3 Supervised and unsupervised person ReID
Based on the availability of matched pairs used in the training phase, person ReID can
be divided into supervised and unsupervised approaches In general, training is the crucial
phase in pattern recognition problems which helps improve performance of the recognition
process Accordingly, pedestrian’s models on cross-view cameras are learned from the
previously collected dataset and the matching pairs are labelled for the training phase.
This requirement not only creates a burden for human labors but also limits the scalability
of person ReID problem Moreover, the data labeling process becomes even more di cult fficulties Moreover, the attention at scientific conferences has
and infeasible when dealing with a large-scale dataset In order to overcome this di culty, fficulties Moreover, the attention at scientific conferences has
some latest studies have followed the unsupervised approach which employs unlabelled
data into consideration [21, 22, 23, 24, 25] From this approach, person ReID is closer to a
realistic context in which thousands of people appear on a camera view everyday in public
spaces, such as airports, railway-stations, super markets Since, labeling task is a
self-taught process, without supervision On the one hand, the unsupervised methods help to
reduce the human labor and toward realistic systems On the other hand, the matching
rates of these methods are often much lower than those of the supervised ones This is
because of that without manually labelled matching pairs in cross-view cameras, existing
unsupervised models cannot learn the appearance transformation of the same person
from di erent camera views ffective guidance, their
As the research focuses on the supervised person ReID approach, two person settings
in which a certain pedestrian’s model is previously trained or not are further discussed In
the former setting, person ReID is turned into person search [26, 27, 28], the identity of an
interested person is determined based on the similarity between this person and
10
Trang 29each of trained models This indicates that the identities of pedestrian in the testing setare the same ones in the training set This setting is suitable and employed in severalreal situations, such as human management and monitoring, suspect/criminal search,etc Obviously, when a pedestrian’s model is previously learned, person ReIDperformance is greatly improved By contrast, in the later one, the model of eachindividual in the testing phase is not learned in advance Specially, the identities of thepedestrians for the test phase are di erent from ones for the training phase Thisffective guidance, theirassumption is more likely to be closer to a realistic context and majority of existingperson ReID studies support the second evaluated setting Figure 1.4 shows the
di erences in two above evaluation settings for person ReID Fig 1.4a) used theffective guidance, theirsame color to describe that the given query exists in the gallery set while Fig 1.4b)illustrates that query person’s appearance models are not previously known,described by the colors of the query persons are di erent from those of the galleryffective guidance, theirpersons In the first setting, query persons exist in the gallery set This means that themodel of query persons is previously learned while in the second setting the queryperson’s appearance models are not previously known In the second setting (Fig.1.4b) the query person’s appearance models are not previously known, described bythe colors of the query persons are di erent from those of the gallery persons.ffective guidance, their
For more details, we will be analysis some prominent researches on person ReID
to have a better understanding on the existing approaches on this problem Similar toany pattern recognition problem, feature extraction and metric learning are the twoindispensible components of person ReID problem Most existing person ReID workshave tried to exploit the e ectiveness of one/both of them and toward the target offfective guidance, theirimproving ReID performance The role of the two steps are discussed in more details
in the two sections (Section 1.3 and 1.4) Section 1.5 introduces several outstandingstudies with di erent fusion strategies to achieve a higher person ReID accuracy Sec-ffective guidance, theirtion 1.6 discusses on representative frames selection and feature pooling, the two keyissues in video-based person ReID systems Finally, end-to-end person ReID systemsare presented in Section 1.7 Specially, not only ReID accuracy but also e ciencyfficulties Moreover, the attention at scientific conferences has(consuming time) as well as memory requirement are considered
1.2 Datasets and evaluation metrics
1.2.1 Datasets
In the literature, there are numerous datasets used for person ReID evaluation Some datasets are set up for either single- or multiple-shot approaches However, the other datasets are set up for both scenarios In single-shot approach, each person has sole one image in both probe and gallery sets Meanwhile, each person is presented by multiple image in both probe and gallery sets in multi-shot approach Additionally, two settings
11
Trang 30(b)
Figure 1.4: Two popular settings for person ReID problem: a) The testing persons have appeared
in the training set (represented by the same colors) b) Persons in the training and testing sets are absolutely di erent ffective guidance, their
12
Trang 31for person ReID will be mentioned in this thesis For the first setting, a person in thetesting set has appeared in the training set Inversely, for the second setting, thetesting and training sets are completely di erent These concepts wil be described inffective guidance, theirmore details in the Chapter 1 Five benchmark datasets used for performanceevaluation of the proposed methods in this thesis will be indicated as follows.
Viewpoint Invariant Pedestrian Recognition (VIPeR) [29] This is one of the most challenging datasets with strong variations in pose, illumination, view-point and occlusion The dataset contains 1,264 images of 632 persons, with image
resolution is 128 48 Each person has a pair of images captured by two di erent ffective guidance, theirnon-overlapping cameras This dataset is used for the single-shot case
CAVIAR4REID [30] This dataset contains 1,220 images of 72 pedestrians captured from two non-overlapping cameras in a shopping mall However, there are only 50 persons appeared in both cameras This dataset is
generated to maximize the variation in illumination conditions, occlusions, image resolutions, and view-points In this dataset, the image resolutions varystrongly, from 17 32 to 72 144 This also cause di culty for person ReID.fficulties Moreover, the attention at scientific conferences hasRAiD [31] Comprising 6,920 images of 43 individuals appeared in four
cameras (two indoor, two outdoor) Only 41 of the 43 total persons appear in all cameras All images in this dataset are normalized to the same size of 64
128 The large illumination variations caused by collecting images from
di erent scenarios is one of the challenges when working on this dataset In ffective guidance, theirthis thesis, the gallery set is generated by selecting randomly 5 images for each person and the remaining images are used for the probe set
PRID-2011 (Person ReID) [32]
Figure 1.5: Camera layout for PRID-2011 dataset [33].
This dataset was created in the Austrian Institute of Technology (AIT) for periments on person ReID Images in this dataset are extracted from multiple
ex-13
Trang 32pedestrian trajectories captured from two static non-overlapping cameras These images su er from large variations in illuminations, view-point, poses, etc Figure 1.5 ffective guidance, their shows camera layout for PRID-2011 dataset, two cameras are installed in two sides
of the building in AIT with di erent view-points The original videos have 475 ffective guidance, their pedestrians from one view and 856 from the other view, in which 245 per-sons appear in both views Some images are filtered out due to strong occlusions, sudden disappearance/appearance or number of reliable images for each person in each camera view less than five After filtering, there are 385 persons in camera view A and 749 persons in camera view B The first 200 persons appear on both views and are used in person ReID experiments It is worth noting that
this thesis follows the experimental setting in [34], only 178 persons having more than 21 images in an image sequence were chosen for evaluation The data is divided into two halves, one for the training and test phase This random division is repeated 10 times and the reported result is the average value of these 10 splits.
iLIDS-VID (Imagery Library for Intelligent Detection Systems ) [35] This
dataset was recorded at an airport arrival hall under a multi-camera CCTV network It consists of 300 pedestrians with 600 image sequences The length
of each sequence varies from 23 to 192 images, with an average number of
73 In practical, this dataset is captured by five non-overlapping cameras as shown in Figure 1.6 In comparison with PRID-2011 dataset, iLIDS-VID is
evaluated more challenging due to the similarities of clothes, the strong
variations in illuminations, viewpoints, cluttered background and occlusions Inperson ReID evaluation, this dataset is also randomly split into two halves, one for training and the remaining for testing This process is performed 10 times to achieve a fair comparison Figure 1.6 describes five real contexts of the camera network for capturing iLIDS-VID dataset
Among the above five datasets, CAIVAR4REID and RAID is setup following thefirst setting, while three remaining datasets are according to the second setting,one haft is used for the training phase and the other used for the test phase Table1.1 summaries the datasets used in the thesis
Table 1.1: Benchmark datasets used in the thesis.
Trang 332 Chapter 1 Introduction
Figure 1.1: Human re-identification in a network of 5 CCTV cameras: the system
Figure 1.6: iLIDS-VID is captured by five non-overlapping cameras [36].
should be able to associate all appearances of the same person with a single identity across disjoint camera views (e.g the lady in red dress appears in two cameras) This video footage is distributed by the UK’s government as the image Library for Intelligent Detection Systems (i-LIDS).
1.1 Motivation and problem statement
Recently, cameras spread out across various domains that range from personal
com-(a) VIPeR
puters, video games, home surveillance applications, to large camera networks which facility access to sports venue, monitored environments, such as airports, metro sta- tions or car parks A natural consequence of such situation is a need for an auto-mated extraction of high-level semantic information from extremely large volumes of recorded video data
In many surveillance systems, detection and tracking of moving objects constitute the main problem The number of targets and occlusions produce ambiguities which introduce a requirement for reacquiring objects which have been lost during track-ing However, (b)CAVIAR4REID theultimate goal of any surveillance system (c) is RAiD not to track and reacquire targets, but to understand a scene and
to establish an identity of the desired object.
Human re-identification can be defined as a determination whether a given person of interest has already been observed over a network of cameras (see figure 1.1 ) This issue is also known as the person re-identification problem Person re-identification can be considered at di erent levels depending on information cues, which are cur- ffective guidance, their rently available in the system Biometrics such as iris, face or gait can be used to recognize identities However, in most video surveillance scenarios such detailed
Figure 1.7: Some images of five datasets used for this thesis a) VIPeR b) CAVIAR4REID c) RAiD d) PRID-2011 e) iLIDS-VID For the first three datasets (VIPeR, CAVIAR4REID, and RAiD), images
in the same column belong to the same person while for the last two datasets, images in the same row represent for the same person.
15
Trang 34100 90
In order to evaluate the proposed methods for person ReID, we used
Cumulative Matching Characteristic (CMC) curves [37] CMC shows a ranked list
of retrieval person based on the similarity between a gallery and a query person
The value of the CMC curve at each rank is the rate of the true matching results
and total number of queried persons The matching rates at several important
ranks (1, 5, 10, 20) are usually used for evaluating the e ectiveness of a certainffective guidance, their
method Figure 1.8 illustrates CMC curves, in which the CMC of the method#2 is
higher than that of the method#1, this means that method#2 is better than the
method#1 Values in the curves caption are accuracies at rank-1
1.3 Feature extraction
The first crucial component for any pattern recognition problem is feature extraction
step In order to describe a pedestrian image, biometric cues (eyes, iris, gait) or visual
appearance is exploited These are considered as the most useful information for person
representation However, because images/videos in person ReID are usually captured
with low resolution, information extracted on eyes or iris is not su cient for person fficulties Moreover, the attention at scientific conferences has
representation Besides, gait is a whole-body, behavioral bio-metric that is considered as
a pedestrian’s characteristic and has been studied for person ReID for a long time.
However, it is not easy for extracting human gait because of the complexity of a realistic
surveillance environment, such as airport, railway station, super market Additionally,
human gait usually strong depends on person mood and health condition Consequently,
majority of existing person ReID studies mainly focus on visual appearance of pedestrian
[8] In general, features are classified into two main categorizes: hand-designed and
deep-learned features In the early days, hand-designed features were proposed for image
representation These features are built on
16
Trang 36experiences and perceptions of researchers [28, 9] Fortunately, in 2014, with therapid development of Convolutional Neural Network (CNN), the first studies on deep-learned features was applied to person ReID Since then, a lot of works have paidattention to exploiting the capability of numerous deep networks Features are alsocategorized into three di erent abstract levels including low-, mid-, and high-levelffective guidance, theirfeatures [38] While low-level features contain color, texture and shape informationextracted from every pixel, more advanced low-level features are created bycomputing a covarian matrix from image derivatives [39] or seeking at local key points(SIFT [40]) Mid-, and high-level features are constructed by learning from pixel-levelfeatures, for example, Bag of Words (BOW) models [41] which encode low-levelfeatures into visual words are considered as mid-level features and deep-learnedfeatures extracted from the last layers of a CNN are high-level ones.
17
Trang 37mation of a pixel such as the pixel co-ordinates, its intensity as well as the first andthe second-derivaties of this pixel By this way, each pixel is presented by a 7-dimension vector and after that, these extracted feature vectors are turned intoFisher Vector , called Local Descriptors encoded by Fisher Vector (LDFV), forperson representation Zhao et al [43] solved person ReID by finding saliencyregions in an pedestrian image In this work, saliency regions are defined as theoutstanding and easily recognizable cues for distinguishing di erent persons.ffective guidance, theirDense patches 10 10 pixels are extracted with a step size of 5 pixels, and then,32-dim LAB color histogram and 128-dim SIFT descriptor are computed on eachpatch Adjacency constrained search is exploited to seek the best match for aquery patch in horizontal stripes with the similar latitudes in gallery images By thisway, the higher scores are assigned to the more prominent patches These scoresare the fundamental for calculating the similarity between two persons.
With the consideration for the influence of illumination variations, in [44] the thors calculated color histograms in di erent color spaces and then, combined theseffective guidance, theirhistograms to take the more robust signature to variations of illumination In this work,the authors claimed that performance of person representation is still not satisfactory ifonly relying on exploiting color histograms Based on this analysis, they proposed anovel Salient Color Names based Color Descriptor (SCNCD) For SCNCD descriptor,
au-16 colors in RGB space including fuchsia, blue, aqua, lime, yellow, red, purple, navy,teal, green, olive, maroon, black, gray, silver, and white are used Salient color namesshow that each color has a certain probability of being assigned to nearest colornames, and the closer one owns a higher probability In order to overcome thechallenge caused by di erent viewpoints, Liao et al [45] focused on horizontalffective guidance, theiroccurrence of local fea-tures and maximize them to build a robust descriptor, namedLOMO This descriptor includes the color and Scale Invariant Local Ternary Pattern(SILTP) histograms [46] Bins in the same horizontal stripe undergo max pooling and athree-scale pyramid model is built before a log transformation LOMO descriptor islater is employed in several works [47, 48]
In [39], co-variance descriptor is employed for image representation in person ReID In which, the target of this descriptor is to encode information on feature variances within a given image region, their correlation with each other as well as their spatial location The
e ectiveness of this descriptor is based on the invariant property of co-variance matrices ffective guidance, their which help to overcome the changes in illumination and rotation Additionally, one more advantage of the covariance descriptor is that it can be com-puted on any kind of images, such as intensity image, color image This descriptor is also used in the study of Matsukawa et al [49] In this study, the authors claimed that mean and covariance are the most important cues for representing a person appearance They proposed Gaussian of Gaussian (GOG) descriptor which inherits the advantage
18
Trang 38of covariance-based descriptor This descriptor is constructed on three di erentffective guidance, theirlevels including pixel-, patch-, and region-level Gaussian distribution is appliedtwice at patch and region levels, therefore, named as Gaussian of Gaussian.
Di erent from directly exploiting low-level color and texture features in the above ffective guidance, their studies, some other works based on attribute-based features can be considered as mid- level representation In comparison with low-level descriptors, attribute features are more invariant and robust to the change of person images captured from cross-view cameras.
In [50], the authors introduced 15 binary attributes for describing a person image relating
to attire and soft biometric, such as short, skirt, sandal, backpack, long-hair, short-hair, gender and etc Additionally, these attribute classifiers are trained by exploiting the low- level color and texture features This information is combined with visual features extracted by SDALF method [14] to get a higher ReID performance.
In [51], the authors embed the binary semantic attributes of the same person incross-view cameras into a continuous low-rank attribute space, so that the attributevector is more discriminative for matching Shi et al [52] introduced a framework tolearn a semantic attribute model from existing fashion photography datasets Theseattributes help person ReID to achieve competitive results Di erent from previousffective guidance, theirworks, the authors take a generative modelling approach relied on the Indian Bu etffective guidance, theirProcess (IBP) This model has several advantages including: simultaneously learning
of all attributes; the ability to naturally exploit training data in both supervised andunsupervised manners With a great e ort, Li et al [53] built a large-scale dataset forffective guidance, theirpedestrian attribute recognition, namely Richly Annotated Pedestrian (RAP) Thisdataset is generated from real multi-camera surveillance scenarios with long term col-lection consisting 41,585 pedestrian samples with 72 attributes In particular, environ-mental and contextual factors are also considered in this dataset
From the above analysis, we can give some remarkable points about hand-designed features as follows Most of the existing studies have tried to build a descriptor in-cluding useful information about color, texture or shape for person representation To overcome
di culties caused by the variations in illuminations, view-points, poses, the above fficulties Moreover, the attention at scientific conferences has information might be extracted on di erent scales/levels to form a descriptor is not only ffective guidance, their robust but also discriminative Besides, attribute-based features are also considered and exploited as complementary information to appearance In fact, hand-designed features are usually built based on knowledge and experiments of researches Therefore, a descriptor might be only e ective on several datasets but not be e ective on the others ffective guidance, their ffective guidance, their This is one of the disadvantage of hand-designed features Addition-ally, to extract more and more information at di erent scales/levels, hand-designed features always have a ffective guidance, their large dimension, for example approximately 26,960 dimensions for LOMO or 27,622 dimensions for GOG descriptor Large-dimensional vectors also bring di culties to fficulties Moreover, the attention at scientific conferences has storage and computation in the person matching step.
19
Trang 391.3.2 Deep-learned features
Recent years have witnessed the impressive results of Convolutional Neural Networks
(CNNs) on pattern recognition tasks For person ReID, CNN and its variants are employed
to achieve a higher performance In 2014, the first two studies on person ReID exploiting
the capability of the deep-learned network are introduced [54, 55] In [54], the authors
proposed a framework using Siamese deep neural network, in which appearance-based
features (color, texture) and similarity metric are simultaneously learned Siamese network
has a symmetrical structure including two sub-networks that are connected by a cosine
layer Each sub-network composes two convolutional layers and a fully connected layer.
By this way, similarity metric is learned directly through features extracted on image
pixels An input image is divided into three overlapping horizontal parts, and then, these
parts are forwarded to the sub-networks and the output of the Siamese network is the
similarity of image pairs based on cosine
each part Fig independently 1.Theflowch and rtof fusing thepropos them dmethod bysum Learning rule three SCNNs for
each part independently and fusing them by sum rule.
Fig 4 Cost function can Fig 4 Cost function candidates for the SCNN
for reference, which is use
of Deviance largest lossis cost very when similar the wit
25
20
Trang 40been proved robust
to optimize the neural network.
Fig 2 The structure of the siamese convolutional neural network (SCNN) not differentiable at s
The SCNN can work in two modes: sharing parameters (General SCNN) and
By plugging Equ (1) into Equ (4)
(b) The structure of SCNN
propagation function to calculate the
Fig 2 The structure of the siamese convolutional neural network (SCNN).
2Cosine(B (x)
Figure 1.9: ProposedTheSCNNframeworkcanworkus in gtwoSiamodes:eseConvolutionsharingparalmetersNeural(GeneralNetworkSCNN)(SCNN)and[54] a) Overall
Differentiating the cost function wi
Di erent from the abov ffective guidance, their e work, Li et al [55] designed a new deep network with a
ar-Convolution Convolution Full connection 2Cosine(B (x),B (y)) l
Differentiating the
photo-metric and geometric transforms, occlusion, and background clutter, are jointly(B2(y) − B1(x)
solved Moreover, instead of employing hand-designed features this deep network
∂J
dev
∂x
B Convolutional Neural Network
and pooling layers are 32, 32, 48 and 48 The output of the
where denotes the magnitude of CNN is 500 dimensions Every pooling layer includes a cross- ∂J
channel normalization unit Befo convolution the input data
parameters of SCNN ∂y by standard e −