Vnanomaly a novel vietnam surveillance video dataset for anomaly detection

Moreover, we also conduct a thorough evaluation of current state-of-the-art for unsupervised anomaly detection methods based on deep architectures including MLEP, Future frame predictio

Trang 1

VNAnomaly: A novel Vietnam surveillance video

dataset for anomaly detection

Tu N Vu

Vietnam National University

University of Information Technology

Ho Chi Minh City, Vietnam

18520184@gm.uit.edu.vn

Toan T Dinh

Vietnam National University University of Information Technology

Ho Chi Minh City, Vietnam 18521504@uit.edu.vn

Nguyen D Vo

Ho Chi Minh City, Vietnam nguyenvd@uit.edu.vn

Tung Minh Tran

Ho Chi Minh City, Vietnam tungtm.ncs@grad.uit.edu.vn

Khang Nguyen

Ho Chi Minh City, Vietnam khangnttm@uit.edu.vn

Abstract—Surveillance systems have long been considered as

an effective tool to capture various realistic abnormal actions

or events in various domains such as traffic management or

security With the smart city development, thousand of installed

surveillance cameras have played a vital role in detection and

prevention of dangerous events However, there is a lack of

anomaly datasets for developing automatic anomaly detection

systems in Vietnam In this study, we introduce a new dataset

named VNAnomaly for anomaly detection in Vietnam Moreover,

we also conduct a thorough evaluation of current

state-of-the-art for unsupervised anomaly detection methods based on deep

architectures including MLEP, Future frame prediction, MNAD,

and MNAD with modified inference on benchmark datasets and

our dataset Experimental results indicate that the proposed

method almost always outperforms the competitors and achieves

the best performance in terms of Area Under the Curve (AUC)

score at 61.14%.

Index Terms—Anomaly, Anomaly Detection, Deep Learning,

VNAnomaly, Autoencoder.

I INTRODUCTION

Nowadays, with advances in artificial intelligence,

integrat-ing surveillance cameras has emerged as an efficient tool

for complicated urban management tasks such as road traffic

monitoring or anomalous event detection An abnormal event

in a surveillance camera is defined as an event that does not

conform to expected behavior [1], [2] The anomaly detection

problem takes a sequence of frames as input and returns the

label of each frame (Normal, Abnormal), see Figure 1 With

the development of smart cities in Vietnam, it is reasonable to

build a surveillance system that can identify abnormal events

such as crimes or illegal activities However, there are not

many studies providing a decent data resource for anomaly

detection in Vietnam Therefore, this study provides a novel

dataset that focuses on human-related events to aim for urban

management

One of the biggest challenges of anomaly detection is the

ambiguity of anomaly definition An abnormal identification

process depends not only on the activities and appearance of

the objects but also depends on the context in the surveillance video [2], [3] Several events are normal in some contexts but are abnormal in another context For example, riding a motorbike in a pedestrian zone is considered an anomaly, but

in a city road context, this is a normal event [2] To avoid this ambiguity, The scope of this work mainly focuses on the urban street scenes in Vietnam and some unusual events that often occur in this context

Model

(Abnormal, Abnormal, Abnormal, Abnormal, Abnormal)

Fig 1 The anomaly detection model takes a video (sequence of frames) as input and returns the label of each frame in the video: normal or abnormal Sample images are taken from VNAnomaly dataset.

Currently, there are two main approaches for anomaly de-tecting problems including unsupervised learning and weakly-supervised learning, which are indicated based on the training data’s experimental setting [3] One most important challenges

of anomaly detection is the lack of anomalous events leading

to an imbalanced dataset The abnormal sample is usually expensive to collect and there is always an unknown and new kind of anomaly existing In the unsupervised learning approach, models are trained with only normal video frames and validated with both normal and anomalous frames On the other hand, models in weakly-supervised learning approaches are trained with both normal and only a very small amount

of anomalous data There are many benchmark datasets pro-vided for both approaches However, most of the unsupervised

Trang 2

datasets are only single-scene so they are not close to

real-world scenes We summarize our contribution as follows:

‚ We introduce a novel dataset for the task of unsupervised

anomaly detection on streets in Vietnam

‚ We conduct a thorough evaluation of current

state-of-the-art methods for unsupervised anomaly detection on the

dataset

‚ We suggest a way to modify the inference stage of

unsupervised approaches, which increases the MNAD

method’s result by about 0.5%-2%, compared to the

previous state-of-the-art methods

The rest of the paper can be organized as follows In section

II, we summarize the related works Section III then describes

the collecting, annotating process, and detailed information of

our dataset Section IV, we discuss the evaluation method and

propose methods for our problem Section V, the evaluation

and the outcomes obtained from different detection methods

are presented The paper ends with a conclusion and some

directions for future work

II RELATEDWORKS

A Anomaly detection

Anomaly detection is a binary classification between the

normal and the anomalous classes and it is one of the most

challenging and long-standing problems in computer vision

[4] For video surveillance applications, there have been many

attempts to detect the abnormalities as well as violence in the

videos Overall, there are two main approaches to solve this

problem including: (1) Unsupervised learning; and (2)

Weakly-supervised learning

1) Unsupervised learning: In contrast to the abundance of

normal events, the probability of appearing abnormal events

is very low Furthermore, it is almost infeasible to gather all

kinds of abnormal events Therefore, in unsupervised learning

methods [1], [5], [6], models are trained with only normal

video frames because of the availability of benchmark datasets

The collected anomaly frames are only used for validation

purposes These methods focus on learning the pattern of

normal frames and use the reconstruction or prediction loss

to determine whether a frame is anomalous for inference, see

Figure 2 After that, they will try to reconstruct, predict the

current frame and use the reconstruct/ predict error to calculate

the anomaly score

Reconstructing

module

Comparing module Anomaly score

Fig 2 An example of an inference process of unsupervised method: Future

frame prediction [1]

2) Weakly supervised learning: For Weakly-supervised learning approaches, the anomalous datasets are mainly col-lected from social media platforms such as Youtube, Facebook [3] The diversity and enormous video capacity of these platforms allow researchers to access and collect a large number of anomaly videos In these approaches [4], [7], abnormal events are explicitly predefined and collected in various contexts from numerous sources Moreover, models are trained with both normal and only a very small amount of anomalous data to learn to distinguish between normal and abnormal events These approaches usually comprise three main modules described in Figure 3: i) Arrange training instances to preprocess the video-level ground truth; ii) Feature extraction to extract video’s features; and iii) Fully connected network to classify whether a frame is anomalous

Arrange training instances Feature extraction Fully connected module Anomaly score

Fig 3 An example of weakly supervised method from Sultani et al [4]

B Existing anomaly datasets

There have been many studies proposing anomaly datasets for anomaly detection in recent years In these datasets, we can separate them into two types: single-scene and multi-scene datasets [2] We discuss each of these below and summarize them in Table I

1) Single-scene datasets: Single-scene datasets usually con-tain only limited scenes which usually are less than three scenes Due to the difficulty in collecting surveillance camera videos back in these days, it was reasonable to take a long video captured by one camera Therefore, there are many single-scene datasets introduced in recent years However, they might be not general enough to satisfy real-world surveillance applications Some popular single-scene datasets’ samples are displayed in Figure 4

the exit gates in a subway station and comprises one video each The entrance gate video sequence is 1 hour

36 minutes long whereas the exit gate video footage is 43 minutes long with a resolution of 384 x 512 Anomalous activities mainly include people jumping or trying to get through the turnstiles without payment, walking in the wrong direction

UCSD Pedestrian 1 (Ped 1) and UCSD Pedestrian 2 (Ped 2) Ped 1 contains 34 training videos and 36 evaluating videos with 40 anomalous events Most of the outliers

in this dataset are related to cyclists, cyclists, and car drivers entering the pedestrian zone Ped 2 consists of

16 training videos and 12 evaluating videos with 12 anomalous events The definition of an anomaly in Ped

Trang 3

TABLE I

C HARACTERISTICS OF VIDEO ANOMALY DETECTION DATASETS FOR UNSUPERVISED APPROACHES

2 is similar to the one in Ped 1 The main difference

between these two subsets is the viewpoints, the dataset’s

size, and the resolution (158 x 238 in Ped 1 and 240 x

360 in Ped 2) Both subsets only contain 1 scene each

videos and 21 evaluating videos (resolution 480 x 856

pixels) with 47 anomalous events in total including

throwing an object, running, jumping This dataset only

captures one scene, but the size of people may change

because of the distance and angle of the camera

Ride a bike

UCSD Ped 1

Ride a bike

UCSD Ped 2

Enter without payment

Subway Entrance CUHK Avenue

Throwing object

Fig 4 Some samples including normal and abnormal frames in the

single-scene datasets are illustrated Red boxes denote anomalies in abnormal frames.

2) Multi-scene datasets: In recent years, the popularity of

surveillance cameras and the rise of video-sharing platforms

has enabled the increase of scenes in anomaly datasets Some

datasets were applied in the recent study including

Shang-haiTech dataset and UCF-Crime dataset Some samples are

shown in Figure 5

Chasing

Shoplifting

Fig 5 Some samples including normal and abnormal frames in the

multi-scene datasets are illustrated Red boxes denote anomalies in abnormal frames.

videos and 107 evaluating videos (resolution 480 x 856

pixels) in the campus Presenting mostly person-based

anomalies, it contains 130 abnormal events captured in

13 different scenes with complex lighting conditions and

camera angles There are some anomaly events such as riding a bike, skateboarding that do not relate to security purposes in general

that spans over 128 hours of videos of 240 × 320 resolution and contains 13 different classes of real-world anomalies Its training split contains 800 normal and 810 anomalous videos, while the test split has 150 normal and

140 anomalous videos This dataset is intended for a very different formulation of video anomaly detection refer to

a weakly supervised anomaly detection

III VNANOMALY DATASET

A Dataset description

To tackle the limitation of existing datasets, we present the VNAnomaly dataset It consists of 90 training videos and 127 evaluating videos that include real-world anomalies in Vietnam street The anomaly types contain 4 common human-related anomalies in the street of Vietnam including fighting, assault, vandalism, and robbery In Figure 6, we show normal and abnormal frames with a nearly similar scene Because the VNAnomaly is an unsupervised dataset, we will not explicitly define these anomaly types The reason we choose the above anomaly types is the popularity of these types compared to other ones In addition, these anomaly types are also relevant

to the safety of public lives and assets in Vietnam’s urban environments However, there are some unusual events that are not mentioned such as traffic accidents Therefore, we will continue to provide more anomaly types soon

Normal

Ride a bike

Fighting Abnormal

Fig 6 The normal and abnormal frames from VNAnomaly dataset

Our dataset surpasses existing unsupervised datasets from the following three aspects

1) To the best of our knowledge, this is one of the first unsupervised datasets that capture the Vietnamese scene 2) Larger data volume: VNAnomaly has 578,609 training frames and 75,214 evaluating frames, which is bigger than existing unsupervised benchmark datasets

Trang 4

3) Higher scene diversity: Our dataset contains 36 scenes

with different aspects such as different camera angles, times

of the day Furthermore, to ensure the context that the model

learns in the training stage matches the testing phase, the

scenes in the training set are required to be suitable with

the testing set As compared with ShanghaiTech dataset [1],

amount of scenes in our dataset triples them Some different

scenes are shown in Figure 7

Fighting

Robbery

Fighting

Fig 7 The diversity of scenes in VNAnomaly dataset.

B Collecting process

To enhance the dataset’s quality, we constrain the context

in this dataset to the street in Vietnam captured by the

surveillance camera We search videos on Youtube using text

search queries in the Vietnamese language In addition, to

ensure the videos collected from Youtube are surveillance

videos, we closely monitor the context in each video and

only choose the videos that are not related to commercial

purposes and are captured by surveillance cameras After that,

following the collecting process in [4], we get rid of the

videos that do not satisfy the following conditions: manually

edited, prank videos, not captured by CCTV cameras, taken

from news, captured using a hand-held camera, and containing

compilation Additionally, we also try to collect training videos

that have some similarities with the videos in the testing

The training sequences last seven and a half hours, whereas

the testing sequences last approximately 1 hour with 110

anomalous events In Figure 3, we show normal and abnormal

frames with a nearly similar scene

More specifically, our dataset contains street scenes in three

different surveillance video types related to time: daytime;

full-color nighttime; and black and white nighttime videos

Our dataset has in total 5 hours 19 minutes 59 seconds

daytime videos and 3 hours 49 seconds nighttime videos The

description of each surveillance video types’ time length in

training and testing set is shown in Figure 8 Moreover, the

scenes in videos are the street scene captured in 2 different

camera angles: left angle (3 hours 39 minutes and 53 seconds)

and right angle (4 hours 19 minutes and 52 seconds) The

diversity of scenes, view angles, light conditions reflect the

real-world environment, which also poses a challenge for

anomaly detection

(a) Training set (b) Testing set Fig 8 The time-length of different surveillance video types

C Annotation

Ground truths for each testing video are annotated in tem-poral form We divide the VNAnomaly dataset into separate parts for annotators Then we have annotators cross-check and re-check the labels ourselves for mistaken checking Our annotation form follows the annotation form in [1], [4] i.e the start and end frames of the anomalous event in each testing anomalous event

IV EVALUATION

A Methods

In this study, we conduct evaluation of three state-of-the-art unsupervised anomaly detection methods: Future frame prediction [1], Margin learning embedded prediction (MLEP) [5], and Learning Memory-guided Normality for Anomaly Detection (MNAD) [6]

baselines for anomaly detection problems It used a gen-erative adversarial network (GAN) [11] to exploit normal patterns and predict the next frame This method focused

on predicting the future frame and uses the predicting error to calculate the anomaly score The illustration of the training process is described in Figure 2

on ConvLSTM [12] In this method, the feature cor-responding to the hidden state of the last input in the ConvLSTM was fed to the margin learning module This module learned a more compact normal data distribution and enlarged the margin between normal and abnormal events which improved the model’s ability to discriminate between normal and abnormal frames

unsupervised methods In this model, a new memory module to record prototypical patterns of normal data on the items in the memory was added Moreover, feature compactness and separateness losses to train the memory module were also proposed

B Measurement

Evaluating scores in most of the works for anomaly detec-tion [1], [5], [6], are calculated based on the receiver operating characteristic curve (ROC curve) The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings The final score is the area under the ROC curve (AUC score) AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1) The higher the AUC score is, the better

Trang 5

model distinguishes between normal and anomalous frames.

AUC scores of experiment results are visualized in Figure 9

Fig 9 AUC score visualization of state-of-the-art methods on VNAnomaly

C Modifying inference stage

When inspecting the visualization of the result of MNAD,

we notice that there is a considerable fluctuation of the

anomaly scores The model can predict the anomaly frames

due to the powerful representation capacity of the CNNs

Therefore, the scores only go on to the bottom in the

begin-AUC = 69.27

(a) MNAD

AUC = 73.66

(b) MNAD w/ modified inference Fig 10 The anomaly scores visualization of MNAD and MNAD with

modified inference on video Right_Robbery_971 In this figure, the

ground-truth score for the normal frame is 1, the abnormal frame’s ground ground-truth score

is vice versa.

ning After that, it starts to increase as the model can predict

the abnormal frames, see figure 10

To mitigate the above issues, we alter the inference stage to

stabilize the scores Instead of using four real previous frames

as an input, we mix the current predicted nearest frame with

the frame that is predicted in the earlier step, see Figure 11 If

there are consecutive anomaly frames then this alteration will

increase the predicting error in the later frames and stabilize

the anomaly scores when frames are anomalous On the other

hand, when the generator returns a high-quality frame, the

inference process is similar to the base ones Moreover, this

alteration should not affect the generator’s ability to predict the

normal frames because the frames are mixed with a low ratio

(1:3) The formation of the mixing strategy is demonstrated

below:

I1

i “ λ1ˆ

i` λ2Ii Where I1

idenotes the mixed frame, ˆIiis the predicted frame,

and Ii is the real frame λ1 and λ2 are chosen using a grid

search: λ1 “ 0.25, λ2 “ 0.75 If the ratio of the predicted

frame information is higher than the actual frame information,

the anomaly score will be more stable However, it will limit the discriminant ability of the model Thus, the 1:3 ratio is determined by using grid search to balance the elements of the predicted and actual frames This alteration slightly increases the effectiveness of MNAD from 0.5% to 1.14% AUC score

Generator

Fig 11 The desmonstration of modified inference process.

V EXPERIMENT

A Experiment setup

We follow the setting of the Future Frame Prediction method [1] to split the existing datasets for the experiment In addition, the training and testing set of the VNAnomaly dataset is split similarly to other unsupervised datasets’ settings: The training set contains normal videos, whereas the testing set contains abnormal videos The whole process is implemented

on GeForce RTX 2080 Ti GPU with memory 11019MiB For the VNAnomaly dataset, we train Future frame prediction, MLEP, and MNAD model for five epochs with batch sizes

of 8, 8, and 4, respectively Other hyper-parameters are set

to default In addition, we follow the setting proposed in the original paper for benchmark datasets

B Experimental results and discussion

Following the aforementioned experiment setting, we inten-sively experiment with three state-of-the-art methods: Future frame prediction [1], MLEP [5], and MNAD [6] Table II summarizes the performance comparison of these methods

on VNAnomaly and other publicly available datasets namely UCSD Ped 1 [8], UCSD Ped 2 [8], Subway Entrance [9], CUHK Avenue [10], ShanghaiTech [1] based on AUC metrics Table II uses AUC at the frame level to show the perfor-mance comparison of various methods on 6 selected video datasets Overall, our modified inference positively affects the MNAD [6] model are competitive results, compared to the current state-of-the-art methods benchmark datasets and outperforms the competitors on VNAnomaly As regards the CUHK Avenue [10], our method is the highest at 89.30% whereas the lowest percentage of MLEP method [5] was at 82.29% Additionally, the figure for the proposed method is slightly higher than the other methods ranged from nearly 0.5% to approximately 3% Out of the other methods, the proposed method has the highest figure at 61.04% and it al-ways outperforms the competitors on the VNAnomaly dataset ranging from over 1% to 3% On the other hand, the figure for the proposed method has higher than MLEP method [5] and MNAD method [6] on Ped 1 dataset [8] at 82.15%, compared

Trang 6

TABLE II AUC OF DIFFERENT METHODS ON THE P ED 1, P ED 2, S UBWAY E NTRANCE , A VENUE , S HANGHAI T ECH AND VNA NOMALY DATASETS

Ped 1 [8] Ped 2 [8] Subway Entrance [9] CUHK Avenue [10] ShanghaiTech [1] VNAnomaly (our)

*The result achieved in the original paper is evaluated in different settings

to MLEP method [5] at 75.37% and MNAD method [6] at

80.37%, whereas our method is slightly lower than Future

frame prediction [1] about 0.18% As indicated clearly from

the Ped 2 dataset [8], the proportion of the proposed method

is the same figures, compared to the other methods at nearly

97.00% Similarly, the proportion of the proposed method is

slightly higher than that of Future frame prediction method

[1] and MNAD method [6] on Subway Entrance dataset [9]

at 72.14%, 71.72% and 69.37%, respectively Likewise, the

number of the proposed method is similar to that of the MLEP

method [5] and MNAD method [6] on ShanghaiTech dataset

[1] at about 70.50% but this figure is slightly lower than that

of the Future frame prediction method [1] about 2% It is

important to point out that the issue of unsupervised anomaly

detection problems in videos is still challenging depending on

a specific context

It is noteworthy that the runtime difference between the

modified and base methods is negligible at 1743,144 seconds

on VNAnomaly and 1744,257 seconds for the modified

infer-ence methods, respectively

VI CONCLUSION

In this paper, we introduce a new dataset named

VNAnomaly for anomaly detection problem in videos The

dataset contains different scenes, multi-objects, and 4 common

anomaly types in Vietnam’s street It is one of the first anomaly

datasets to capture the scenes in Vietnam and reflect the

real-world challenge with the variety of angles and diversity

of time Furthermore, we also conduct extensive experiments

using three unsupervised anomaly detection methods wherein

we adopt a new modified inference for the MNAD method

Our experiments demonstrate improvements over

state-of-the-art methods on real-world datasets In our future work, we

plan to collect more anomaly types in Vietnam to increase the

diversity of the dataset, continue improving anomaly detection

models, and try to implement on edge devices

ACKNOWLEDGEMENT

This work was supported by the Multimedia Processing

Lab (MMLab) at the University of Information Technology,

VNUHCM

REFERENCES

[1] W Liu, W Luo, D Lian, and S Gao, “Future frame

prediction for anomaly detection - a new baseline,” in

2018 IEEE/CVF Conference on Computer Vision and

Pattern Recognition, 2018, pp 6536–6545

[2] B Ramachandra, M J Jones, and R R Vatsavai,

“A survey of single-scene video anomaly detection,”

CoRR, vol abs/2004.05993, 2020 arXiv: 2004.05993 [Online] Available: https://arxiv.org/abs/2004.05993 [3] S Zhu, C Chen, and W Sultani, “Video anomaly

detec-tion for smart surveillance,” CoRR, vol abs/2004.00222,

2020 arXiv: 2004 00222 [Online] Available: https : //arxiv.org/abs/2004.00222

[4] W Sultani, C Chen, and M Shah, “Real-world anomaly

detection in surveillance videos,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion, IEEE, Jun 2018 [Online] Available: https://doi org/10.1109/cvpr.2018.00678

[5] W Liu, W Luo, Z Li, P Zhao, and S Gao, “Margin learning embedded prediction for video anomaly

detec-tion with a few anomalies,” in IJCAI, 2019.

[6] H Park, J Noh, and B Ham, “Learning memory-guided

normality for anomaly detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition (CVPR), IEEE, Jun 2020 [Online] Available: https://doi.org/10.1109/cvpr42600.2020.01438

[7] M.-I Georgescu, A Barbalau, R T Ionescu, F S Khan, M Popescu, and M Shah, “Anomaly detection

in video via self-supervised and multi-task learning,”

CoRR, vol abs/2011.07491, 2020 arXiv: 2011.07491 [Online] Available: https://arxiv.org/abs/2011.07491 [8] V Mahadevan, W Li, V Bhalodia, and N Vasconcelos,

“Anomaly detection in crowded scenes,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp 1975–1981

[9] A Adam, E Rivlin, I Shimshoni, and D Reinitz,

“Robust real-time unusual event detection using

mul-tiple fixed-location monitors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 30,

no 3, pp 555–560, Mar 2008 [Online] Available: https://doi.org/10.1109/tpami.2007.70825

[10] C Lu, J Shi, and J Jia, “Abnormal event detection

at 150 fps in matlab,” in 2013 IEEE International Conference on Computer Vision, 2013, pp 2720–2727 [11] I Goodfellow, J Pouget-Abadie, M Mirza, B Xu, D Warde-Farley, S Ozair, A Courville, and Y Bengio,

“Generative adversarial networks,” Communications of the ACM, vol 63, no 11, pp 139–144, Oct 2020 [Online] Available: https://doi.org/10.1145/3422622 [12] S Xingjian, Z Chen, H Wang, D.-Y Yeung, W.-K Wong, and W.-c Woo, “Convolutional lstm network:

A machine learning approach for precipitation

now-casting,” in Advances in neural information processing

Tiêu đề	Vnanomaly A Novel Vietnam Surveillance Video Dataset for Anomaly Detection
Tác giả	Tu N. Vu, Toan T. Dinh, Nguyen D. Vo, Tung Minh Tran, Khang Nguyen
Trường học	Vietnam National University, University of Information Technology
Chuyên ngành	Information and Computer Science
Thể loại	conference paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	6
Dung lượng	3,64 MB