schemes are proposed for both setting of multi-shot person ReID.. sub-space on which projected feature vectors are satisfied above-mentioned conditions.1.4 Fusion schemes for person ReID
Trang 1MINISTRY OF EDUCATION AND TRAINING
UNIVERSITY OF SCIENCE AND TECHNOLOGY
NGUYEN THUY BINH
Trang 2This study is completed at:
Hanoi University of Science and Technology
Supervisors:
1 Assoc Prof Pham Ngoc Nam
2 Assoc Prof Le Thi Lan
Reviewer 1: Assoc Prof Tran Duc Tan
Reviewer 2: Assoc Prof Le Nhat Thang
Reviewer 3: Assoc Prof Ngo Quoc Tao
This dissertation will be defended before approval commitee
at Hanoi University of Science and Technology:
Time 9h00 , date 08 month 01 year 2021
This dissertation can be found at:
1 Ta Quang Buu Library - Hanoi University of Science and Technology
2 Vietnam National Library
Trang 3Motivation
The development of image processing and pattern recognition allows to build an au-tomatic video analysis system This system contains four crucial steps: person detection,tracking, person re-identification and recognition Person re-identification is defined as a
problem which associates images/sequences of images of a pedestrian when he or she moves in
a non-overlapping camera network [7] Although achieving some important milestones, person
single-shot approach each person has sole one image in both gallery and probe sets Inversely,
each person has multiple images in multi-shot approach Noted that probe and gallery sets
view-points, poses, etc, (2) the large number of images for each person in a camera view and the number of persons, (3) the effect of human detection and tracking results
Objective
The thesis has three main objectives as follows:
Trang 4schemes are proposed for both setting of multi-shot person ReID Beside equal weights,
feature weights are adaptively determined for each query based on the query character-istics
Trang 5of person ReID Chapter 4 presents a fully-automated person ReID system including human
detection, tracking and person ReID The affect of human detection and segmentation steps
1.1 Datasets and evaluation metrics
Table 1.1 Benchmark datasets used in the thesis
Datasets Time #ID #Cam #Images Label Full frames Resolution Single-shot Multiple-shot Setting VIPeR 2007 632 2 1,264 hand 128x48 X 2
CAVIAR4REID 2011 72 2 1,220 hand vary X 1
PRID-2011 2011 934 2 24,541 hand + 128x65 X X 2
iLIDS-VID 2016 300 2 42,495 hand vary X 2
Five benchmark datasets including VIPeR, CAVIAR4REID, RAiD, PRID-2011 and
VID are used for performance evaluation of the proposed methods in this thesis Among the
above five datasets, CAVIAR4REID and RAID is setup following the first setting, while three
Trang 6sub-space on which projected feature vectors are satisfied above-mentioned conditions.
1.4 Fusion schemes for person ReID
function to get the final score
1.5 Representative frame selection
all frames in a sequences for person representation [6, 16, 24]
1.6 Fully automated person ReID systems
A fully automated person ReID system has three main phases: human detection, tracking,
REPRESENTATIVE FRAMES SELECTION AND
TEMPORAL FEATURE POOLING
Trang 7and person matching The first step aims to determining the representative frames used for
person representation Three strategies of representative frame selection are introduced in
this work: four key frames, a walking cycle, and all images Once frames used for person
representation are determined, Gaussian of Gaussian descriptors (GOG) [18] are extracted
(XQDA)[14] technique is performed at the final step to compute the matched individuals foreach given probe person
Image-level features
Temporal pooling layer
Extract walking cycles
Extract key 4 frames
Image-level features
Extract walking cycles
Extract key 4 frames
Figure 2.1 The proposed framework consists of four main steps: representative image selec-tion, image-level feature extraction, temporal feature pooling and person matching
Algorithm 2.2 is conducted online in the test phase
Firstly, a representative walking cycle is chosen from a set of walking cycles of a person
during the moving path based on Flow Engery Profile (FEP) [21] Secondly, four key frames
are taken from this walking cycle Four representative frames are extracted from a walking
cycle: two frames corresponding to local minimum and maximum points of FEP and twoframes that are the middle frames between max- and min-frames
Trang 8Algorithm 2.1: Algorithm for training phase (Off-line process).
Input: Image sequences on cross-view cameras: X = {X i} , i = 1, Ntr;
Z = {Z j} , j = 1, N tr Ntris the number of persons used for training
Output: Model parameters: W, M
Trang 9Algorithm 2.2: Algorithm for test phase (On-line process).
Input: A query person: Qi
the gallery set.)
Parameters of the trained model: W, M
as occupied memory Three pooling strategies including min-, average-, and max-pooling
across all video frames are applied in this work
Trang 102.2.5 Person matching
XQDA technique is an extended version of the Bayesian face and Keep It Simple and
Straightforward MEtric (KISSME) [11] algorithms, in which the multi-class classification prob-
tational time
Table 2.1 shows the comparison of the three schemes in terms of person ReID accuracy,
computational time, and memory requirement on PRID-2011 dataset The values reported
four key frames, one walking cycle and all frames schemes are 96KB, 312KB, and 2,400KB,
Trang 117 9 1 0 % P R I D _ w a
accuracies at rank-5 of using four key frames, one walking cycle, and all frames are 94.70%,
94.99%, and 97.98%, respectively while those at rank-10 are 97.93%, 97.92%, and 99.55%
at rank-1, computational time, and memory requirement on PRID 2011 dataset
Methods Accuracy at rank-1 Frame Computational time for each person (s) Memory
selection
Feature extraction Feature pooling
Person matching Total time Four key frames 77.19 7.500 3.960 0.024 0.004 11.488 96 KB
Walking cycle 79.10 7.500 12.868 0.084 0.004 20.452 312 KB
All frames 90.56 0.000 98.988 1.931 0.004 100.919 2,400 KB
Table 2.2 shows the comparison between the obtained results of the proposed framework
with the state-of-the-art methods, the two best results are in bold This Table shows that the
proposed method outperforms all state-of-the art methods at the rank-1, even in comparison
Trang 12Table 2.2 Comparison between the proposed method and existing works on PRID 2011 and
iLIDS-VID datasets Two best results are in bold
Matching rates (%) Rank=1 Rank=5 Rank=20 Rank=1 Rank=5 Rank=20
STFV3D + KISSME, ICCV 2015 64.1 87.3 92.0 44.3 71.7 91.7
obtains competitive results in comparison with the method STFV3D [16] Considering three
high-cost computation, large amount of memory, and this becomes a serious problem when
dealing with a large-scale dataset
2.4 Conclusions and Future work
Trang 13This work bases on an assumption that a person stays in field of camera view in a certain timeduration In the reality, this assumption does not always hold In the future, the proposed
BASED ON FUSION SCHEMES
3.1 Introduction
This chapter will show that person ReID accuracy still can be improved through fusion
schemes Both kind of features including hand-designed and deep-learned features are used forimage representation For hand-designed features, GOG [18] and Kernel Descriptor (KDES)
[1] are considered, while, for deep-learned features, two of the most strongest convolutionalneural networks that are GoogLeNet and Residual Neural Network (ResNet) are employed
Multi-shot person ReID can be divided further into two sub categories: image-to-images
3.2.1.1 The proposed framework
3.2.1.2 Feature fusion strategies
Trang 14based late fusion Matching
Product-rule-and ranking
ID person
Extracting CNN feature
Extracting KDES feature
Extracting CNN feature
SVM Prediction Early fusion
Model
Query-adaptive late fusion
Extracting GOG feature
Extracting GOG feature
Training phase
Testing phase
Figure 3.1 Image-to-images person ReID scheme
the two common rules that are sum-rule and product-rule Beside equal weights, inspired by
- Product-rule with equal weights:
M
Y
m=1
- Product-rule with query-adaptive weights:
Figure 3.2 shows the proposed framework for images-to-images person ReID In this
framework, the temporal linking between images of the same person is not required, and these
Trang 15images are treated independently Images-to-images problem is formulated as a fusion function
Image-images person re-identification
Image-images person re-identification
Late fusion based on Product rule
Matching and ranking personID
3.2.3 Obtained results on the first setting
For the first setting, two benchmark datasets: CAVIAR4REID and RAiD are used for
2% to 5% compared to those in the case of using only KDES and CNN features
3.2.3.2 Images-to-images person ReID
By applying the product rule strategy, image-to-images person ReID is mapped into
images-to-images one Figure 3.5 shows the performance of images-to-images person ReID in
Trang 16
5 1 0 1 5 4 0 5 0 6 0 7 0 8 0 9 0 M a c in g a e (
R a n k
6 7 4 7 % G O G
6 5 5 0 % K D E
6 2 6 4 % C N N (a)
5 1 0 1 5 5 0 6 0 7 0 8 0 9 0 1 0 0 M a c in g a s %
R a n k
8 2 8 3 % G O G
8 1 1 9 % K D E
8 2 8 9 % C N N (b)
5 1 0 1 5 5 0 6 0 7 0 8 0 9 0 M a c in g a e ( %
R a n k
8 4 8 6 % G O G
8 1 6 0 % K D E
8 4 7 9 % C N N (c) Figure 3.3 Evaluation the performance of three chosen features (GOG, KDES, CNN) over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-case B (c) RAiD datasets in image-to-images case
5 1 0 1 5 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 C M C - C A V I A R 4 R E I D S
M a h g a ( % )
R a n k
3 7 6 9 % S D A L F
6 7 3 1 % E a r l y - f u s i o n ( K D E
7 0 6 4 % P r o d u c t - r u l e ( K D E
7 0 6 1 % Q u e r y - a d a p t i v e ( K
7 2 5 0 % E a r l y - f u s i o n ( G O G
7 3 5 8 % P r o d u c t - r u l e ( G O G
7 3 6 1 % Q u e r y - a d a p t i v e ( G (a)
5 1 0 1 5 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 C M C - C A V I A R 4 R E I D S
M a c in g a e % )
R a n k
4 9 9 7 % S D A L F
8 6 9 7 % E a r l y - f u s i o n ( K D
8 8 6 1 % P r o d u c t - r u l e ( K D
8 8 1 7 % Q u e r y - a d a p t i v e (
8 8 1 7 % E a r l y - f u s i o n ( G O
9 0 3 3 % P r o d u c t - r u l e ( G O
8 9 8 3 % Q u e r y - a d a p t i v e ( (b)
5 1 0 1 5 6 0 8 0 1 0 0 C M C - R A i D S v s M
M a h g r a ( % )
R a n k
5 9 6 3 % S D A L F
8 6 8 5 % E a r l y - f u s i o n ( K D
8 7 6 3 % P r o d u c t - r u l e ( K D
8 7 2 7 % Q u e r y - a d a p t i v e (
8 9 2 9 % E a r l y - f u s i o n ( G O
8 8 4 6 % P r o d u c t - r u l e ( G O
8 8 9 8 % L a t e - f u s i o n ( G O G (c)
Figure 3.4 Comparison the performance of the three fusion schemes when using two or three
features over 10 trials on (a) CAVIAR4REID-case A (b) CAVIAR4REID-case B (c) RAiD datasets in image-to-images case
5 1 0 1 5 6 0 6 5 7 0 7 5 8 0 8 5 9 0 9 5 1 0 0 C M C - C A V I A R 4 R E I D M
M a tc h in g r a te ( % )
R a n k
6 7 5 0 % S D A L F
9 1 5 3 % M v s MG O G + S V M
9 1 3 9 % M v s MK D E S + S V M
8 8 0 6 % M v s MC N N + S V M
9 4 4 4 % M v s ME a r l y - f u s i o n
9 3 8 9 % M v s MP r o d u c t - r u l e
9 4 3 1 % M v s MQ u e r y - a d a p t
Figure 3.5 CMC curves in case A of images-to-images person ReID on the CAVIAR4REID dataset
Trang 17Table 3.2 Comparison of images-to-images and image-to-images schemes at rank-1 (*)
Trang 18Extract walking cycles
Extract key 4 frames
features
level features
Sequence-Image-level features
level features
Metric learning Extract
walking cycles
Extract key 4 frames
Query-adaptive late fusion
Matching and ranking Extracting GOG
features Extracting ResNet features
Figure 3.6 The proposed method for video-based person ReID by combining the fusion schemewith metric learning technique
provides a higher performance compared to GOG descriptor, and matching rates at rank-1
are improved by 13.1%, 13.68%, and 14.13% It can be explained that a deeper structure
as ResNet can learn the complex background and determine useful information for personrepresentation The above-mentioned experimental results are compared with several existing
1 0 0
7 9 1 0 % G O G
7 1 3 6 % R e s N e t
8 4 5 7 % P r o d u c t - r u l e a d
8 0 5 6 % R e s N e t
9 1 4 6 % P r o d u c t - r u l e a
Trang 195 0 7 0 % R e s N e t
6 0 6 1 % P r o d u c t - r u l e a d
6 7 6 7 % R e s N e t
8 0 7 3 % P r o d u c t - r u l e a d
frames b) frames within a walking cycle c) all frames
Table 3.3 Comparison between the proposed method and existing works on PRID 2011 and
iLIDS-VID datasets.Two best results are in bold
Trang 20The best matching rate at rank-1 in this method are 93.3% and 82.0%, higher by 1.8% and 0.2%
compared to those of the proposed framework on PRID-2011 and iLIDS-VID, respectively,however CFFM method has to incorporate both CNN and RNN combined with multipleattention networks
3.4 Conclusions
This chapter proposes several feature fusion schemes person ReID in both settings For the
CHAPTER 4 QUANTITATIVE EVALUATION OF AN END-TO-END
PERSON REID PIPELINE
Then, person bounding boxes within a camera field of view (FoV) are connected through
person tracking step Finally, person ReID associates images of the same person when he/she
moves from a camera FoV to the others ones It is worth to noting that in some surveillancesystems, the person segmentation and person detection are coupled
Trang 21Human detection (Automatic/manual)Segmentation Person
Re-identification
ID person Tracking
Probe Gallery
Figure 4.1 The proposed framework for a fully automatic person ReID system
4.2.1 Pedestrian detection
Concerning person detection, three state-of-the-art person detection techniques that areAggregate Channel Features (ACF) [3], You Only Look Once (YOLO) [20], and Mask R-
CNN [10] are considered For person segmentation, Pedparsing [17] method is used thanks
respectively In order to bring person ReID to practical applications, GOG descriptor is implemented in C++ and the optimal parameters of this descriptor are selected throughintensive experiments The experimental results show that the proposed approach allows
Trang 225 0 6 7 % A u t o - d e t e
4 2 1 4 % A u t o - d e t ( A C F ) + A u
3 7 8 7 % A u t o - d e t ( Y O L O ) + A
performance than YOLO one in both cases (without/with segmentation) In addition, the
with the proposed method
88.76% Auto_Detection+Segmentation with the proposed methodwith the proposed method
Figure 4.3 CMC curves of three evaluated scenarios on PRID 2011 dataset when applying
the proposed method in Chapter 2
Figure 4.3 indicates the matching rates on PRID 2011 dataset when employing GOG