Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Phuc Thinh Do[.]
Trang 1Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result
Phuc Thinh Do Dong Nai Technology University
Dong Nai, Vietnam
dophucthinh@dntu.edu.vn
Abstract— With the vigorous development of CNN, most
crowd counting methods have approached using CNN to estimate
the density map and then infer the count However, these methods
face many limitations due to limited receptive fields, background
noise, etc With the advent of Transformer in natural language
processing, it is possible to utilize this model for the crowd
counting problem The Transformer can model the global context,
so it helps to solve the problem of receptive fields On the other
hand, with the attention mechanism, the model can focus on areas
of concentration of people, helping to solve the problem of
background noise In this paper, we propose a Crowd counting
model combining Transformer and Density map (TDCrowd) to
estimate the number of people in a crowd With the use of a
Transformer, TDCrowd can still be trained so that it does not need
information about the location of people in the crowd, but only
information about the count Experiments on three datasets
ShanghaiTech, UCF_QNRF, and JHU-Crowd++, show that
TDCrowd gives better results when compared to regression-based
methods (need only the count information) and density map-based
(need the count information and location information)
Keywords— crowd counting, convolutional neural networks,
density map, Transformer, attention
I INTRODUCTION
Crowd counting refers to estimating the number of objects
in a crowd, such as people, vehicles, trees, etc It is one of the
essential tasks in surveillance systems The original approach of
crowd counting was detecting objects and counting the number
of detected objects Since object detection in images with dense
density is not good, a new method has arisen, the regression
method Methods of this type will attempt to map between the
crowd image and the count Recently, with the development of
deep learning, the crowd counting method has moved in a new
direction, using density maps (Fig 1) This method helps to take
advantage of spatial information by obtaining information about
the position of objects in the image However, density
map-based methods will face problems such as limited receptive
fields, background noise, etc With the advent of the
Transformer [26] in natural language processing, many works
have taken advantage of this model for image processing [1],
[7] The advantage of Transformer is that it can capture global
information, which can solve the problem of limited receptive
fields of CNN in general and density map-based methods in
particular In this paper, we propose a combined model of
Transformer and density map (TDCrowd) This model can take
advantage of the Attention mechanism to focus on crowded
areas, thereby solving the problem of background noise On the
other hand, with the ability to collect global information, the model solves the problem of limited receptive fields
In summary, we propose a model that uses the Transformer
to generate density maps instead of just estimating counts This approach allows our model to capture the global context, thus solving the problem of limited receptive fields of CNN On the other hand, we can leverage location information to improve this density map because we can generate a density map from the head positions The rest of the paper is organized as follows In section II, we will discuss crowd counting approaches and some related research Next, we will present the proposed model, baseline, and how to train the model In part IV, we will talk about experimenting and analyzing the results Finally, we will conclude and outline the future direction in the last part
Fig 1 Sample images in the JHU-Crowd++ dataset and their density maps
II RELATED WORK
Previous crowd counting methods often follow the direct counting approach or use regression models The direct counting methods use a sliding window over the image to detect objects [11], [29] Several works use CNN to build a detection model to predict bounding boxes [17], [21] The number of bounding boxes is also the number of people in the image However, with scenes with too many people, these models will have difficulty identifying individual objects Another approach is to use a regression model [1], [9], [22] Methods of this type will build a regression model to map the crowd image and the count number However, this makes the results of the model more difficult to interpret On the other hand, since only one count number information is used, the model lacks spatial information Lempitsky [15] proposed a mapping method from the input
Trang 2image to density map that can be integrated to obtain the final
counts Density maps take advantage of spatial information,
which is the location of people in the image With the
development of CNN, many methods of using CNN to generate
density maps are proposed MCNN [31] uses multi-column
CNNs with different filter sizes to extract features in different
scales Switch-CNN [23] improves M-CNN by using additional
VGG-16 classifiers to select the appropriate CNN column, while
Do et al [4], [5] focus on removing the non-human scene
CSRNet [16] uses dilated convolution to increase receptive
fields CHF [6] uses filters to improve density map quality
Recently, many works have used Transformer for image
processing and achieved many good results DETR [1], ViT [7]
are the first works to apply Transformer for object detection and
recognition Inspired by ViT, Liang et at [18] proposed
TransCrowd, one of the first methods to use Transformers for
crowd counting The input image will be converted to sequence
data and turned into a count This method can collect global
information but ignores information about the location of the
person in the image
III PROPOSED METHOD
Our proposed model uses the Transformer [26] to build the
density map The sum of pixel values of the density map
represents the number of people Inspired by ViT [7], our model
uses the Transformer encoder to get information However,
instead of mapping directly to the count number, the output of
the Transformer encoder will be convolution by a 1×1
convolution to generate a density map This method helps solve
the background noise problem because the attention mechanism
helps to focus attention on areas where people are present
Furthermore, the Transformer can receive global information,
which helps solve the problem of limited receptive fields The
model can be trained in two ways: with position information or
only with count information Suppose the data has information
about the location of people in the image In that case, the model
will be trained using the L2 loss function between the estimated
density map and ground truth density map On the other hand, if
only the count information is available, from the estimated
density map after convolution by a 1x1 convolution [31], the
model will calculate the count and use the L1 loss function
between prediction count and ground truth count The proposed
model is depicted in Fig 2
A Input image processing
Because the input of the Transformer [26] is sequence data,
we convert the input image into equal parts With the input image of size ℎ 3 (where is the width, ℎ is the height),
we divide it into parts of size 3 The number of patches obtained will be patches Then we stack these patches into
3 (Fig 2) To convert sequence into a latent D-dimensional embedding feature, we use a learnable projection : → 1, 2, … , where has size D To maintain the position information, we adopt a specific position embedding
1, 2, …
B Transformer encoder
The Transformer encoder [26] consists of ! blocks of multi-head self-attention (MSA) and Multilayer Perceptron (MLP) blocks Every block, layer normalization (LN), and residual connections are applied The MLP contains two layers with a GELU activation function [8] The first layer expands the embedding dimension from " to 4" , and the second one compresses the size from 4" to " The output of the Transformer encoder ($% is represented as follows:
$% '()*! $%+ 1 , $%-. / 1, 2, … !
$% '!0*! $% , $% / 1, 2, … ! MSA has 1 independent self-attention (SA) modules and a re-projection operation ( ) The input of each SA contains three information: query (2), key (3 , and value (4 The value of SA can be defined as follows:
2 $%- :, 3 $%- ;, 4 $%- @
where 2, 3, 4 are three learnable matrices The softmax function is applied for the input matrix Different from ViT [7], the desired output of the model is a density map Therefore, we add a 1 1 convolution layer to regress the density map (Fig 2)
Fig 2 The proposed model is in the training stage.
Trang 3Fig 3 The proposed model is in the testing stage
C The ground truth density map
For datasets with location information, to be able to train the
model, we create the ground truth density map by using
Gaussian kernel at each head position:
where ABC is the ground truth density map, E is a Gaussian
kernel with standard deviation F , n is the number of head
positions The final count is obtained by summing the values of
the density map Similar to the method using density maps, we
also choose F 15
D Loss function
After obtaining the density map by convolution with
1 1 convolution, the model will be trained using the L2 loss
function to measure the difference between the estimated
density map and the ground truth density map:
L2 Θ KL. ∑ MA Θ + AL BCMKK
where ' is the number of images, ABC is the ground truth
density map of the -th image, A Θ is the estimated density
map of the -th image with parameters Θ
However, with the use of the Transformer [26], the proposed
model can also be trained using the L1 loss function to measure
the difference between predicted count and ground truth count:
where is the number of images, PBC is the ground truth count
of the -th image, P Θ is the predicted count of the -th image with parameters Θ
The details of the algorithm used to build and train the model are shown in Fig 4
Algorithm Training
Input: Annotated image with ground truth density map dm (flag = 1) or ground truth count gtc (flag = 0)
Output: Trained model Begin
optimizer = Adam(lr=1e-5)
if (flag == 1): // with the position information
model = VisionTransformer model.Sequential(Conv2d(kernel_size=1))
if (flag == 0): // with the count information
model = VisionTransformer
foreach image in dataset:
if (flag == 1):
// Estimated density map est = model(image)
l2loss = MSEloss(est, dm)
// Optimized using the Pytorch library optimizer.zero_grad()
l2loss.backward() optimizer.step()
if (flag == 0):
est = model(image) count = est.sum() // Calculate the count
l1loss = L1loss(gtc, count)
optimizer.zero_grad() loss.backward() optimizer.step() End
Fig 4 Model training algorithm
IV EXPERIMENTS
We evaluate our method on three datasets, including ShanghaiTech, UCF-QRNF, and JHU-CROWD++ The model
is implemented in python using the PyTorch library When processing the input image, we choose n = 3 [4] To increase the training data, we use transformations such as rotation, inversion, and random cropping Due to memory limitations, we limit the size to 1024 and use sliding windows when dealing with UCF-QRNF and JHU-CROWD++ datasets We used Adam optimizer [13] with a learning rate of 1e-5, weight decay 1e-4 The evaluation metric and results of the experiment are shown below
A Evaluation Metric
For comparison with the previous methods, we use two evaluation metrics; is Mean Absolute Error (MAE) and Mean Squared Error (MSE):
Trang 4 '(Q RN.∑ P + PN BC K
where is the number of images, P is the estimated count,
PBC is the ground truth count MAE indicates the accuracy of the
predicted result, and MSE measures the robustness
B ShanghaiTech dataset
This dataset [31] is divided into parts A and B Part A
includes 300 training images and 182 testing images scrawled
from the internet Part B consists of 400 training images and 316
testing images Images in Part B are taken from the metropolis
in Shanghai city In total, the dataset includes 1198 images with
330,165 annotations We can see the results in TABLE I ,
TDCrowd gives better results than other methods, especially for
images with a high density of people
TABLE I C OMPARISON WITH OTHER METHODS ON THE
S HANGHAI T ECH DATASET
Switch-CNN [23] 90.4 135.0 21.6 33.4
Do et al [4] 81.9 122.1 20.9 33.1
TransCrow [18] 66.1 105.1 9.3 16.1
Fig 5 Visualization results of our model on the ShanghaiTech Part A dataset
The left column is the sample image, and the right column is the estimated
density map
Fig 6 Visualization results of our model on the ShanghaiTech Part B dataset The left column is the sample image, and the right column is the estimated density map
C UCF-QRNF dataset
This dataset [10] includes 1535 images with 1.25 million annotations In particular, 1201 training images and 334 testing images This dataset is for crowd counting and localization, which contains realistic scenarios captured in the wild The total number of people in each image ranges from 49 to 12,865 On this large-scale dataset, TDCrowd outperforms the state-of-the-art methods TDCrowd reduces the MAE of TransCrowd from 97.2 to 83.0 and MSE from 168.5 to 143.4
TABLE II C OMPARISON WITH OTHER METHODS ON THE UCF-QRNF
DATASET
Idrees et al [10] 132.0 191.0
Trang 5Fig 7 Visualization results of our model on the UCF-QRNF dataset The left
column is the sample image, and the right column is the estimated density map
D JHU-Crowd++ dataset
Fig 8 Visualization results of our model on the JHU-Crowd++ dataset The
left column is the sample image, and the right column is the estimated density
map
TABLE III C OMPARISON WITH OTHER METHODS ON THE JHU-C ROWD ++
DATASET
MCNN [31] 160.6 377.7 188.9 483.4
CSRNet [16] 72.2 249.9 85.9 309.2
CAN [20] 89.5 239.3 100.1 314.0
SANet [1] 82.1 272.6 91.1 320.4
CG-DRCN [25] 67.9 262.1 82.3 328.0
TransCrow [18] 56.8 193.6 - -
The dataset [25] includes 2722 training images, 1600 testing
images, and 500 validation images collected from diverse
scenarios and weather conditions It has negative samples
(images without people) with a count range of 0 to 25,791
However, because the size of the dataset is quite large, there are few methods to use this data set to evaluate their model TABLE III shows the results of TDCrowd outperforming other methods when evaluated on the validation and testing sets
E Ablation Study Comparison with Transformer method for crowd counting:
TransCrow was one of the first methods to use Transformer for estimating the number of people in a crowd However, TransCrow only uses count information and focuses on exploiting the attention mechanism The experimental results in TABLE IV show that our method is better because it has more information about the location
Comparison with methods using only count information:
These methods use less information than methods using density maps, so we compare TDCrowd with these methods in the aspect that uses only the count and does not use the position information TABLE IV shows that TDCrowd ranked second when evaluated on the ShanghaiTech Part A dataset Although MAE and MSE of TDCrowd were higher than TransCrowd using GAP, TDCrowd still improved by 2.1 MAE points, 8.2 MSE points compared to TransCrowd using Token
TABLE IV C OMPARISON WITH METHODS THAT DO NOT USE LOCATION
INFORMATION
V CONCLUSION AND FUTURE WORK
We have proposed a model using the Transformer for density map construction This model can capture global information to solve the problem of limited receptive fields On the other hand, with the use of density maps, the model still takes advantage of the location information of the crowd datasets Experimentally, when using the density map, our method is better than methods using the Transformer to estimate the count
In the future, we will continue to improve the density map model for better results On the other hand, we will also study counting methods on other objects such as animals, fruits, books, etc
REFERENCES [1] Cao, Xinkun, et al "Scale aggregation network for accurate and efficient crowd counting." Proceedings of the European Conference on Computer Vision (ECCV) 2018
[2] Carion, Nicolas, et al "End-to-end object detection with transformers." European Conference on Computer Vision Springer, Cham, 2020 [3] K Chen, C C Loy, S Gong, and T Xiang Feature mining for localised crowd counting In BMVC, 2012
[4] Phuc Thinh Do, and Ngoc Quoc Ly A New Framework For Crowded Scene Counting Based On Weighted Sum Of Regressors and Human Classifier In SoICT ’18: Ninth International Symposium on Information and Communication Technology, 2018
Trang 6[5] Do, Phuc Thinh, Manh Thuong Phan, and Thien Tam Chan Le "A
single-column convolutional neural networks for crowd counting." 2019 6th
NAFOSTED Conference on Information and Computer Science (NICS)
IEEE, 2019
[6] Do, Phuc Thinh, and Ngoc Quoc Ly "A New High Performance
Approach for Crowd Counting Using Human Filter." 2020 7th
NAFOSTED Conference on Information and Computer Science (NICS)
IEEE, 2020
[7] Dosovitskiy, Alexey, et al "An image is worth 16x16 words:
Transformers for image recognition at scale." arXiv preprint
arXiv:2010.11929 (2020)
[8] Hendrycks, Dan, and Kevin Gimpel "Bridging nonlinearities and
stochastic regularizers with gaussian error linear units." (2016)
[9] H Idrees, I Saleemi, C Seibert, and M Shah Multi-source multi-scale
counting in extremely densecrowd images In Proceedings of the IEEE
Conferenceon Computer Vision and Pattern Recognition, pages 2547–
2554, 2013
[10] H Idrees, M Tayyab, K Athrey, D Zhang, S Al-Maadeed, N Rajpoot,
and M Shah, Composition loss for counting, density map estimation and
localization in dense crowds In ECCV, 2018, pp 532–546
[11] W Ge and R T Collins Marked point processes for crowd counting In
Computer Vision and Pattern Recognition, 2009 CVPR 2009 IEEE
Conference on, pages 2913–2920 IEEE, 2009
[12] X Jiang, Z Xiao, B Zhang, X Zhen, X Cao, D Doermann, and L Shao,
“Crowd counting and density estimation by trellis encoderdecoder
network,” CVPR, 2019
[13] DP Kingma, Diederik P, and Jimmy BA Adam: A method for stochastic
optimization arXiv preprint arXiv:1412.6980, 2014
[14] Lei, Yinjie, et al "Towards using count-level weak supervision for crowd
counting." Pattern Recognition 109 (2021): 107616
[15] V Lempitsky and A Zisserman Learning to count objects in images In
Advances in neural information processing systems, pages 1324–1332,
2010
[16] Yuhong Li, Xiaofan Zhang, and Deming Chen CSRNet: Dilated
convolutional neural networks for understanding the highly congested
scenes CVPR, 2018
[17] Lian, Dongze, et al "Density map regression guided detection network
for rgb-d crowd counting and localization." Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition 2019
[18] Liang, Dingkang, et al "TransCrowd: Weakly-Supervised Crowd
Counting with Transformer." arXiv preprint arXiv:2104.09116 (2021)
[19] Liu, Ning, et al "Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2019
[20] Liu, Weizhe, Mathieu Salzmann, and Pascal Fua "Context-aware crowd counting." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019
[21] Liu, Yuting, et al "Point in, box out: Beyond counting persons in crowds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019
[22] Paragios, Nikos, and Visvanathan Ramesh "A MRF-based approach for real-time subway monitoring." Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR
2001 Vol 1 IEEE, 2001
[23] D B Sam, S Surya, R V Babu Switching convolutional neural network for crowd counting In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
[24] Karen Simonyan and Andrew Zisserman Very deep convolutional networks for large-scale image recognition arXivpreprint arXiv:1409.1556, 2014
[25] Sindagi, Vishwanath, Rajeev Yasarla, and Vishal MM Patel "Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin Attention is all you need In NIPS, 2017
[27] Wan, Jia, Ziquan Liu, and Antoni B Chan "A Generalized Loss Function for Crowd Counting and Localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 [28] Wang, Boyu, et al "Distribution matching for crowd counting." arXiv preprint arXiv:2009.13077 (2020)
[29] M Wang and X Wang Automatic adaptation of a generic pedestrian detector to a specific traffic scene In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3401–3408 IEEE, 2011
[30] Yang, Yifan, et al "Weakly-supervised crowd counting learns from sorting rather than locations." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 Springer International Publishing, 2020
[31] Y Zhang, D Zhou, S Chen, S Gao, Y Ma Single image crowd counting via multi-column convolutional neural network In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016