Attention in crowd counting using the transformer and density map to improve counting result

Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Phuc Thinh Do[.]

Trang 1

Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result

Phuc Thinh Do Dong Nai Technology University

Dong Nai, Vietnam

dophucthinh@dntu.edu.vn

Abstract— With the vigorous development of CNN, most

crowd counting methods have approached using CNN to estimate

the density map and then infer the count However, these methods

face many limitations due to limited receptive fields, background

noise, etc With the advent of Transformer in natural language

processing, it is possible to utilize this model for the crowd

counting problem The Transformer can model the global context,

so it helps to solve the problem of receptive fields On the other

hand, with the attention mechanism, the model can focus on areas

of concentration of people, helping to solve the problem of

background noise In this paper, we propose a Crowd counting

model combining Transformer and Density map (TDCrowd) to

estimate the number of people in a crowd With the use of a

Transformer, TDCrowd can still be trained so that it does not need

information about the location of people in the crowd, but only

information about the count Experiments on three datasets

ShanghaiTech, UCF_QNRF, and JHU-Crowd++, show that

TDCrowd gives better results when compared to regression-based

methods (need only the count information) and density map-based

(need the count information and location information)

Keywords— crowd counting, convolutional neural networks,

density map, Transformer, attention

I INTRODUCTION

Crowd counting refers to estimating the number of objects

in a crowd, such as people, vehicles, trees, etc It is one of the

essential tasks in surveillance systems The original approach of

crowd counting was detecting objects and counting the number

of detected objects Since object detection in images with dense

density is not good, a new method has arisen, the regression

method Methods of this type will attempt to map between the

crowd image and the count Recently, with the development of

deep learning, the crowd counting method has moved in a new

direction, using density maps (Fig 1) This method helps to take

advantage of spatial information by obtaining information about

the position of objects in the image However, density

map-based methods will face problems such as limited receptive

fields, background noise, etc With the advent of the

Transformer [26] in natural language processing, many works

have taken advantage of this model for image processing [1],

[7] The advantage of Transformer is that it can capture global

information, which can solve the problem of limited receptive

fields of CNN in general and density map-based methods in

particular In this paper, we propose a combined model of

Transformer and density map (TDCrowd) This model can take

advantage of the Attention mechanism to focus on crowded

areas, thereby solving the problem of background noise On the

other hand, with the ability to collect global information, the model solves the problem of limited receptive fields

In summary, we propose a model that uses the Transformer

to generate density maps instead of just estimating counts This approach allows our model to capture the global context, thus solving the problem of limited receptive fields of CNN On the other hand, we can leverage location information to improve this density map because we can generate a density map from the head positions The rest of the paper is organized as follows In section II, we will discuss crowd counting approaches and some related research Next, we will present the proposed model, baseline, and how to train the model In part IV, we will talk about experimenting and analyzing the results Finally, we will conclude and outline the future direction in the last part

Fig 1 Sample images in the JHU-Crowd++ dataset and their density maps

II RELATED WORK

Previous crowd counting methods often follow the direct counting approach or use regression models The direct counting methods use a sliding window over the image to detect objects [11], [29] Several works use CNN to build a detection model to predict bounding boxes [17], [21] The number of bounding boxes is also the number of people in the image However, with scenes with too many people, these models will have difficulty identifying individual objects Another approach is to use a regression model [1], [9], [22] Methods of this type will build a regression model to map the crowd image and the count number However, this makes the results of the model more difficult to interpret On the other hand, since only one count number information is used, the model lacks spatial information Lempitsky [15] proposed a mapping method from the input

Trang 2

image to density map that can be integrated to obtain the final

counts Density maps take advantage of spatial information,

which is the location of people in the image With the

development of CNN, many methods of using CNN to generate

density maps are proposed MCNN [31] uses multi-column

CNNs with different filter sizes to extract features in different

scales Switch-CNN [23] improves M-CNN by using additional

VGG-16 classifiers to select the appropriate CNN column, while

Do et al [4], [5] focus on removing the non-human scene

CSRNet [16] uses dilated convolution to increase receptive

fields CHF [6] uses filters to improve density map quality

Recently, many works have used Transformer for image

processing and achieved many good results DETR [1], ViT [7]

are the first works to apply Transformer for object detection and

recognition Inspired by ViT, Liang et at [18] proposed

TransCrowd, one of the first methods to use Transformers for

crowd counting The input image will be converted to sequence

data and turned into a count This method can collect global

information but ignores information about the location of the

person in the image

III PROPOSED METHOD

Our proposed model uses the Transformer [26] to build the

density map The sum of pixel values of the density map

represents the number of people Inspired by ViT [7], our model

uses the Transformer encoder to get information However,

instead of mapping directly to the count number, the output of

the Transformer encoder will be convolution by a 1×1

convolution to generate a density map This method helps solve

the background noise problem because the attention mechanism

helps to focus attention on areas where people are present

Furthermore, the Transformer can receive global information,

which helps solve the problem of limited receptive fields The

model can be trained in two ways: with position information or

only with count information Suppose the data has information

about the location of people in the image In that case, the model

will be trained using the L2 loss function between the estimated

density map and ground truth density map On the other hand, if

only the count information is available, from the estimated

density map after convolution by a 1x1 convolution [31], the

model will calculate the count and use the L1 loss function

between prediction count and ground truth count The proposed

model is depicted in Fig 2

A Input image processing

Because the input of the Transformer [26] is sequence data,

we convert the input image into equal parts With the input image of size ℎ 3 (where is the width, ℎ is the height),

we divide it into parts of size 3 The number of patches obtained will be patches Then we stack these patches into

3 (Fig 2) To convert sequence into a latent D-dimensional embedding feature, we use a learnable projection : → 1, 2, … , where has size D To maintain the position information, we adopt a specific position embedding

1, 2, …

B Transformer encoder

The Transformer encoder [26] consists of ! blocks of multi-head self-attention (MSA) and Multilayer Perceptron (MLP) blocks Every block, layer normalization (LN), and residual connections are applied The MLP contains two layers with a GELU activation function [8] The first layer expands the embedding dimension from " to 4" , and the second one compresses the size from 4" to " The output of the Transformer encoder ($% is represented as follows:

 $% '()*! $%+ 1 , $%-. / 1, 2, … !  

 $% '!0*! $% , $% / 1, 2, … !   MSA has 1 independent self-attention (SA) modules and a re-projection operation ( ) The input of each SA contains three information: query (2), key (3 , and value (4 The value of SA can be defined as follows:

 2 $%- :, 3 $%- ;, 4 $%- @ 

where 2, 3, 4 are three learnable matrices The softmax function is applied for the input matrix Different from ViT [7], the desired output of the model is a density map Therefore, we add a 1 1 convolution layer to regress the density map (Fig 2)

Fig 2 The proposed model is in the training stage.

Trang 3

Fig 3 The proposed model is in the testing stage

C The ground truth density map

For datasets with location information, to be able to train the

model, we create the ground truth density map by using

Gaussian kernel at each head position:

where ABC is the ground truth density map, E is a Gaussian

kernel with standard deviation F , n is the number of head

positions The final count is obtained by summing the values of

the density map Similar to the method using density maps, we

also choose F 15

D Loss function

After obtaining the density map by convolution with

1 1 convolution, the model will be trained using the L2 loss

function to measure the difference between the estimated

density map and the ground truth density map:

 L2 Θ KL. ∑ MA Θ + AL BCMKK

where ' is the number of images, ABC is the ground truth

density map of the -th image, A Θ is the estimated density

map of the -th image with parameters Θ

However, with the use of the Transformer [26], the proposed

model can also be trained using the L1 loss function to measure

the difference between predicted count and ground truth count:

where is the number of images, PBC is the ground truth count

of the -th image, P Θ is the predicted count of the -th image with parameters Θ

The details of the algorithm used to build and train the model are shown in Fig 4

Algorithm Training

Input: Annotated image with ground truth density map dm (flag = 1) or ground truth count gtc (flag = 0)

Output: Trained model Begin

optimizer = Adam(lr=1e-5)

if (flag == 1): // with the position information

model = VisionTransformer model.Sequential(Conv2d(kernel_size=1))

if (flag == 0): // with the count information

model = VisionTransformer

foreach image in dataset:

if (flag == 1):

// Estimated density map est = model(image)

l2loss = MSEloss(est, dm)

// Optimized using the Pytorch library optimizer.zero_grad()

l2loss.backward() optimizer.step()

if (flag == 0):

est = model(image) count = est.sum() // Calculate the count

l1loss = L1loss(gtc, count)

optimizer.zero_grad() loss.backward() optimizer.step() End

Fig 4 Model training algorithm

IV EXPERIMENTS

We evaluate our method on three datasets, including ShanghaiTech, UCF-QRNF, and JHU-CROWD++ The model

is implemented in python using the PyTorch library When processing the input image, we choose n = 3 [4] To increase the training data, we use transformations such as rotation, inversion, and random cropping Due to memory limitations, we limit the size to 1024 and use sliding windows when dealing with UCF-QRNF and JHU-CROWD++ datasets We used Adam optimizer [13] with a learning rate of 1e-5, weight decay 1e-4 The evaluation metric and results of the experiment are shown below

A Evaluation Metric

For comparison with the previous methods, we use two evaluation metrics; is Mean Absolute Error (MAE) and Mean Squared Error (MSE):

Trang 4

 '(Q RN.∑ P + PN BC K

where is the number of images, P is the estimated count,

PBC is the ground truth count MAE indicates the accuracy of the

predicted result, and MSE measures the robustness

B ShanghaiTech dataset

This dataset [31] is divided into parts A and B Part A

includes 300 training images and 182 testing images scrawled

from the internet Part B consists of 400 training images and 316

testing images Images in Part B are taken from the metropolis

in Shanghai city In total, the dataset includes 1198 images with

330,165 annotations We can see the results in TABLE I ,

TDCrowd gives better results than other methods, especially for

images with a high density of people

TABLE I C OMPARISON WITH OTHER METHODS ON THE

S HANGHAI T ECH DATASET

Switch-CNN [23] 90.4 135.0 21.6 33.4

Do et al [4] 81.9 122.1 20.9 33.1

TransCrow [18] 66.1 105.1 9.3 16.1

Fig 5 Visualization results of our model on the ShanghaiTech Part A dataset

The left column is the sample image, and the right column is the estimated

density map

Fig 6 Visualization results of our model on the ShanghaiTech Part B dataset The left column is the sample image, and the right column is the estimated density map

C UCF-QRNF dataset

This dataset [10] includes 1535 images with 1.25 million annotations In particular, 1201 training images and 334 testing images This dataset is for crowd counting and localization, which contains realistic scenarios captured in the wild The total number of people in each image ranges from 49 to 12,865 On this large-scale dataset, TDCrowd outperforms the state-of-the-art methods TDCrowd reduces the MAE of TransCrowd from 97.2 to 83.0 and MSE from 168.5 to 143.4

TABLE II C OMPARISON WITH OTHER METHODS ON THE UCF-QRNF

DATASET

Idrees et al [10] 132.0 191.0

Trang 5

Fig 7 Visualization results of our model on the UCF-QRNF dataset The left

column is the sample image, and the right column is the estimated density map

D JHU-Crowd++ dataset

Fig 8 Visualization results of our model on the JHU-Crowd++ dataset The

left column is the sample image, and the right column is the estimated density

map

TABLE III C OMPARISON WITH OTHER METHODS ON THE JHU-C ROWD ++

DATASET

MCNN [31] 160.6 377.7 188.9 483.4

CSRNet [16] 72.2 249.9 85.9 309.2

CAN [20] 89.5 239.3 100.1 314.0

SANet [1] 82.1 272.6 91.1 320.4

CG-DRCN [25] 67.9 262.1 82.3 328.0

TransCrow [18] 56.8 193.6 - -

The dataset [25] includes 2722 training images, 1600 testing

images, and 500 validation images collected from diverse

scenarios and weather conditions It has negative samples

(images without people) with a count range of 0 to 25,791

However, because the size of the dataset is quite large, there are few methods to use this data set to evaluate their model TABLE III shows the results of TDCrowd outperforming other methods when evaluated on the validation and testing sets

E Ablation Study Comparison with Transformer method for crowd counting:

TransCrow was one of the first methods to use Transformer for estimating the number of people in a crowd However, TransCrow only uses count information and focuses on exploiting the attention mechanism The experimental results in TABLE IV show that our method is better because it has more information about the location

Comparison with methods using only count information:

These methods use less information than methods using density maps, so we compare TDCrowd with these methods in the aspect that uses only the count and does not use the position information TABLE IV shows that TDCrowd ranked second when evaluated on the ShanghaiTech Part A dataset Although MAE and MSE of TDCrowd were higher than TransCrowd using GAP, TDCrowd still improved by 2.1 MAE points, 8.2 MSE points compared to TransCrowd using Token

TABLE IV C OMPARISON WITH METHODS THAT DO NOT USE LOCATION

INFORMATION

V CONCLUSION AND FUTURE WORK

We have proposed a model using the Transformer for density map construction This model can capture global information to solve the problem of limited receptive fields On the other hand, with the use of density maps, the model still takes advantage of the location information of the crowd datasets Experimentally, when using the density map, our method is better than methods using the Transformer to estimate the count

In the future, we will continue to improve the density map model for better results On the other hand, we will also study counting methods on other objects such as animals, fruits, books, etc

REFERENCES [1] Cao, Xinkun, et al "Scale aggregation network for accurate and efficient crowd counting." Proceedings of the European Conference on Computer Vision (ECCV) 2018

[2] Carion, Nicolas, et al "End-to-end object detection with transformers." European Conference on Computer Vision Springer, Cham, 2020 [3] K Chen, C C Loy, S Gong, and T Xiang Feature mining for localised crowd counting In BMVC, 2012

[4] Phuc Thinh Do, and Ngoc Quoc Ly A New Framework For Crowded Scene Counting Based On Weighted Sum Of Regressors and Human Classifier In SoICT ’18: Ninth International Symposium on Information and Communication Technology, 2018

Trang 6

[5] Do, Phuc Thinh, Manh Thuong Phan, and Thien Tam Chan Le "A

single-column convolutional neural networks for crowd counting." 2019 6th

NAFOSTED Conference on Information and Computer Science (NICS)

IEEE, 2019

[6] Do, Phuc Thinh, and Ngoc Quoc Ly "A New High Performance

Approach for Crowd Counting Using Human Filter." 2020 7th

NAFOSTED Conference on Information and Computer Science (NICS)

IEEE, 2020

[7] Dosovitskiy, Alexey, et al "An image is worth 16x16 words:

Transformers for image recognition at scale." arXiv preprint

arXiv:2010.11929 (2020)

[8] Hendrycks, Dan, and Kevin Gimpel "Bridging nonlinearities and

stochastic regularizers with gaussian error linear units." (2016)

[9] H Idrees, I Saleemi, C Seibert, and M Shah Multi-source multi-scale

counting in extremely densecrowd images In Proceedings of the IEEE

Conferenceon Computer Vision and Pattern Recognition, pages 2547–

2554, 2013

[10] H Idrees, M Tayyab, K Athrey, D Zhang, S Al-Maadeed, N Rajpoot,

and M Shah, Composition loss for counting, density map estimation and

localization in dense crowds In ECCV, 2018, pp 532–546

[11] W Ge and R T Collins Marked point processes for crowd counting In

Computer Vision and Pattern Recognition, 2009 CVPR 2009 IEEE

Conference on, pages 2913–2920 IEEE, 2009

[12] X Jiang, Z Xiao, B Zhang, X Zhen, X Cao, D Doermann, and L Shao,

“Crowd counting and density estimation by trellis encoderdecoder

network,” CVPR, 2019

[13] DP Kingma, Diederik P, and Jimmy BA Adam: A method for stochastic

optimization arXiv preprint arXiv:1412.6980, 2014

[14] Lei, Yinjie, et al "Towards using count-level weak supervision for crowd

counting." Pattern Recognition 109 (2021): 107616

[15] V Lempitsky and A Zisserman Learning to count objects in images In

Advances in neural information processing systems, pages 1324–1332,

2010

[16] Yuhong Li, Xiaofan Zhang, and Deming Chen CSRNet: Dilated

convolutional neural networks for understanding the highly congested

scenes CVPR, 2018

[17] Lian, Dongze, et al "Density map regression guided detection network

for rgb-d crowd counting and localization." Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition 2019

[18] Liang, Dingkang, et al "TransCrowd: Weakly-Supervised Crowd

Counting with Transformer." arXiv preprint arXiv:2104.09116 (2021)

[19] Liu, Ning, et al "Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2019

[20] Liu, Weizhe, Mathieu Salzmann, and Pascal Fua "Context-aware crowd counting." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019

[21] Liu, Yuting, et al "Point in, box out: Beyond counting persons in crowds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019

[22] Paragios, Nikos, and Visvanathan Ramesh "A MRF-based approach for real-time subway monitoring." Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR

2001 Vol 1 IEEE, 2001

[23] D B Sam, S Surya, R V Babu Switching convolutional neural network for crowd counting In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

[24] Karen Simonyan and Andrew Zisserman Very deep convolutional networks for large-scale image recognition arXivpreprint arXiv:1409.1556, 2014

[25] Sindagi, Vishwanath, Rajeev Yasarla, and Vishal MM Patel "Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin Attention is all you need In NIPS, 2017

[27] Wan, Jia, Ziquan Liu, and Antoni B Chan "A Generalized Loss Function for Crowd Counting and Localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 [28] Wang, Boyu, et al "Distribution matching for crowd counting." arXiv preprint arXiv:2009.13077 (2020)

[29] M Wang and X Wang Automatic adaptation of a generic pedestrian detector to a specific traffic scene In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3401–3408 IEEE, 2011

[30] Yang, Yifan, et al "Weakly-supervised crowd counting learns from sorting rather than locations." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 Springer International Publishing, 2020

[31] Y Zhang, D Zhou, S Chen, S Gao, Y Ma Single image crowd counting via multi-column convolutional neural network In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016

Tiêu đề	Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result
Tác giả	Phuc Thinh Do
Trường học	Dong Nai Technology University
Chuyên ngành	Information and Computer Science
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Dong Nai

Định dạng
Số trang	6
Dung lượng	737,18 KB