Moreover, we also conduct a thorough evaluation of current state-of-the-art for unsupervised anomaly detection methods based on deep architectures including MLEP, Future frame predictio
Trang 1VNAnomaly: A novel Vietnam surveillance video
dataset for anomaly detection
Tu N Vu
Vietnam National University
University of Information Technology
Ho Chi Minh City, Vietnam
18520184@gm.uit.edu.vn
Toan T Dinh
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam 18521504@uit.edu.vn
Nguyen D Vo
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam nguyenvd@uit.edu.vn
Tung Minh Tran
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam tungtm.ncs@grad.uit.edu.vn
Khang Nguyen
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam khangnttm@uit.edu.vn
Abstract—Surveillance systems have long been considered as
an effective tool to capture various realistic abnormal actions
or events in various domains such as traffic management or
security With the smart city development, thousand of installed
surveillance cameras have played a vital role in detection and
prevention of dangerous events However, there is a lack of
anomaly datasets for developing automatic anomaly detection
systems in Vietnam In this study, we introduce a new dataset
named VNAnomaly for anomaly detection in Vietnam Moreover,
we also conduct a thorough evaluation of current
state-of-the-art for unsupervised anomaly detection methods based on deep
architectures including MLEP, Future frame prediction, MNAD,
and MNAD with modified inference on benchmark datasets and
our dataset Experimental results indicate that the proposed
method almost always outperforms the competitors and achieves
the best performance in terms of Area Under the Curve (AUC)
score at 61.14%.
Index Terms—Anomaly, Anomaly Detection, Deep Learning,
VNAnomaly, Autoencoder.
I INTRODUCTION
Nowadays, with advances in artificial intelligence,
integrat-ing surveillance cameras has emerged as an efficient tool
for complicated urban management tasks such as road traffic
monitoring or anomalous event detection An abnormal event
in a surveillance camera is defined as an event that does not
conform to expected behavior [1], [2] The anomaly detection
problem takes a sequence of frames as input and returns the
label of each frame (Normal, Abnormal), see Figure 1 With
the development of smart cities in Vietnam, it is reasonable to
build a surveillance system that can identify abnormal events
such as crimes or illegal activities However, there are not
many studies providing a decent data resource for anomaly
detection in Vietnam Therefore, this study provides a novel
dataset that focuses on human-related events to aim for urban
management
One of the biggest challenges of anomaly detection is the
ambiguity of anomaly definition An abnormal identification
process depends not only on the activities and appearance of
the objects but also depends on the context in the surveillance video [2], [3] Several events are normal in some contexts but are abnormal in another context For example, riding a motorbike in a pedestrian zone is considered an anomaly, but
in a city road context, this is a normal event [2] To avoid this ambiguity, The scope of this work mainly focuses on the urban street scenes in Vietnam and some unusual events that often occur in this context
Model
(Abnormal, Abnormal, Abnormal, Abnormal, Abnormal)
Fig 1 The anomaly detection model takes a video (sequence of frames) as input and returns the label of each frame in the video: normal or abnormal Sample images are taken from VNAnomaly dataset.
Currently, there are two main approaches for anomaly de-tecting problems including unsupervised learning and weakly-supervised learning, which are indicated based on the training data’s experimental setting [3] One most important challenges
of anomaly detection is the lack of anomalous events leading
to an imbalanced dataset The abnormal sample is usually expensive to collect and there is always an unknown and new kind of anomaly existing In the unsupervised learning approach, models are trained with only normal video frames and validated with both normal and anomalous frames On the other hand, models in weakly-supervised learning approaches are trained with both normal and only a very small amount
of anomalous data There are many benchmark datasets pro-vided for both approaches However, most of the unsupervised
Trang 2datasets are only single-scene so they are not close to
real-world scenes We summarize our contribution as follows:
‚ We introduce a novel dataset for the task of unsupervised
anomaly detection on streets in Vietnam
‚ We conduct a thorough evaluation of current
state-of-the-art methods for unsupervised anomaly detection on the
dataset
‚ We suggest a way to modify the inference stage of
unsupervised approaches, which increases the MNAD
method’s result by about 0.5%-2%, compared to the
previous state-of-the-art methods
The rest of the paper can be organized as follows In section
II, we summarize the related works Section III then describes
the collecting, annotating process, and detailed information of
our dataset Section IV, we discuss the evaluation method and
propose methods for our problem Section V, the evaluation
and the outcomes obtained from different detection methods
are presented The paper ends with a conclusion and some
directions for future work
II RELATEDWORKS
A Anomaly detection
Anomaly detection is a binary classification between the
normal and the anomalous classes and it is one of the most
challenging and long-standing problems in computer vision
[4] For video surveillance applications, there have been many
attempts to detect the abnormalities as well as violence in the
videos Overall, there are two main approaches to solve this
problem including: (1) Unsupervised learning; and (2)
Weakly-supervised learning
1) Unsupervised learning: In contrast to the abundance of
normal events, the probability of appearing abnormal events
is very low Furthermore, it is almost infeasible to gather all
kinds of abnormal events Therefore, in unsupervised learning
methods [1], [5], [6], models are trained with only normal
video frames because of the availability of benchmark datasets
The collected anomaly frames are only used for validation
purposes These methods focus on learning the pattern of
normal frames and use the reconstruction or prediction loss
to determine whether a frame is anomalous for inference, see
Figure 2 After that, they will try to reconstruct, predict the
current frame and use the reconstruct/ predict error to calculate
the anomaly score
Reconstructing
module
Comparing module Anomaly score
Fig 2 An example of an inference process of unsupervised method: Future
frame prediction [1]
2) Weakly supervised learning: For Weakly-supervised learning approaches, the anomalous datasets are mainly col-lected from social media platforms such as Youtube, Facebook [3] The diversity and enormous video capacity of these platforms allow researchers to access and collect a large number of anomaly videos In these approaches [4], [7], abnormal events are explicitly predefined and collected in various contexts from numerous sources Moreover, models are trained with both normal and only a very small amount of anomalous data to learn to distinguish between normal and abnormal events These approaches usually comprise three main modules described in Figure 3: i) Arrange training instances to preprocess the video-level ground truth; ii) Feature extraction to extract video’s features; and iii) Fully connected network to classify whether a frame is anomalous
Arrange training instances Feature extraction Fully connected module Anomaly score
Fig 3 An example of weakly supervised method from Sultani et al [4]
B Existing anomaly datasets
There have been many studies proposing anomaly datasets for anomaly detection in recent years In these datasets, we can separate them into two types: single-scene and multi-scene datasets [2] We discuss each of these below and summarize them in Table I
1) Single-scene datasets: Single-scene datasets usually con-tain only limited scenes which usually are less than three scenes Due to the difficulty in collecting surveillance camera videos back in these days, it was reasonable to take a long video captured by one camera Therefore, there are many single-scene datasets introduced in recent years However, they might be not general enough to satisfy real-world surveillance applications Some popular single-scene datasets’ samples are displayed in Figure 4
the exit gates in a subway station and comprises one video each The entrance gate video sequence is 1 hour
36 minutes long whereas the exit gate video footage is 43 minutes long with a resolution of 384 x 512 Anomalous activities mainly include people jumping or trying to get through the turnstiles without payment, walking in the wrong direction
UCSD Pedestrian 1 (Ped 1) and UCSD Pedestrian 2 (Ped 2) Ped 1 contains 34 training videos and 36 evaluating videos with 40 anomalous events Most of the outliers
in this dataset are related to cyclists, cyclists, and car drivers entering the pedestrian zone Ped 2 consists of
16 training videos and 12 evaluating videos with 12 anomalous events The definition of an anomaly in Ped
Trang 3TABLE I
C HARACTERISTICS OF VIDEO ANOMALY DETECTION DATASETS FOR UNSUPERVISED APPROACHES
2 is similar to the one in Ped 1 The main difference
between these two subsets is the viewpoints, the dataset’s
size, and the resolution (158 x 238 in Ped 1 and 240 x
360 in Ped 2) Both subsets only contain 1 scene each
videos and 21 evaluating videos (resolution 480 x 856
pixels) with 47 anomalous events in total including
throwing an object, running, jumping This dataset only
captures one scene, but the size of people may change
because of the distance and angle of the camera
Ride a bike
UCSD Ped 1
Ride a bike
UCSD Ped 2
Enter without payment
Subway Entrance CUHK Avenue
Throwing object
Fig 4 Some samples including normal and abnormal frames in the
single-scene datasets are illustrated Red boxes denote anomalies in abnormal frames.
2) Multi-scene datasets: In recent years, the popularity of
surveillance cameras and the rise of video-sharing platforms
has enabled the increase of scenes in anomaly datasets Some
datasets were applied in the recent study including
Shang-haiTech dataset and UCF-Crime dataset Some samples are
shown in Figure 5
Chasing
Shoplifting
Fig 5 Some samples including normal and abnormal frames in the
multi-scene datasets are illustrated Red boxes denote anomalies in abnormal frames.
videos and 107 evaluating videos (resolution 480 x 856
pixels) in the campus Presenting mostly person-based
anomalies, it contains 130 abnormal events captured in
13 different scenes with complex lighting conditions and
camera angles There are some anomaly events such as riding a bike, skateboarding that do not relate to security purposes in general
that spans over 128 hours of videos of 240 × 320 resolution and contains 13 different classes of real-world anomalies Its training split contains 800 normal and 810 anomalous videos, while the test split has 150 normal and
140 anomalous videos This dataset is intended for a very different formulation of video anomaly detection refer to
a weakly supervised anomaly detection
III VNANOMALY DATASET
A Dataset description
To tackle the limitation of existing datasets, we present the VNAnomaly dataset It consists of 90 training videos and 127 evaluating videos that include real-world anomalies in Vietnam street The anomaly types contain 4 common human-related anomalies in the street of Vietnam including fighting, assault, vandalism, and robbery In Figure 6, we show normal and abnormal frames with a nearly similar scene Because the VNAnomaly is an unsupervised dataset, we will not explicitly define these anomaly types The reason we choose the above anomaly types is the popularity of these types compared to other ones In addition, these anomaly types are also relevant
to the safety of public lives and assets in Vietnam’s urban environments However, there are some unusual events that are not mentioned such as traffic accidents Therefore, we will continue to provide more anomaly types soon
Normal
Ride a bike
Fighting Abnormal
Fig 6 The normal and abnormal frames from VNAnomaly dataset
Our dataset surpasses existing unsupervised datasets from the following three aspects
1) To the best of our knowledge, this is one of the first unsupervised datasets that capture the Vietnamese scene 2) Larger data volume: VNAnomaly has 578,609 training frames and 75,214 evaluating frames, which is bigger than existing unsupervised benchmark datasets
Trang 43) Higher scene diversity: Our dataset contains 36 scenes
with different aspects such as different camera angles, times
of the day Furthermore, to ensure the context that the model
learns in the training stage matches the testing phase, the
scenes in the training set are required to be suitable with
the testing set As compared with ShanghaiTech dataset [1],
amount of scenes in our dataset triples them Some different
scenes are shown in Figure 7
Fighting
Robbery
Fighting
Fig 7 The diversity of scenes in VNAnomaly dataset.
B Collecting process
To enhance the dataset’s quality, we constrain the context
in this dataset to the street in Vietnam captured by the
surveillance camera We search videos on Youtube using text
search queries in the Vietnamese language In addition, to
ensure the videos collected from Youtube are surveillance
videos, we closely monitor the context in each video and
only choose the videos that are not related to commercial
purposes and are captured by surveillance cameras After that,
following the collecting process in [4], we get rid of the
videos that do not satisfy the following conditions: manually
edited, prank videos, not captured by CCTV cameras, taken
from news, captured using a hand-held camera, and containing
compilation Additionally, we also try to collect training videos
that have some similarities with the videos in the testing
The training sequences last seven and a half hours, whereas
the testing sequences last approximately 1 hour with 110
anomalous events In Figure 3, we show normal and abnormal
frames with a nearly similar scene
More specifically, our dataset contains street scenes in three
different surveillance video types related to time: daytime;
full-color nighttime; and black and white nighttime videos
Our dataset has in total 5 hours 19 minutes 59 seconds
daytime videos and 3 hours 49 seconds nighttime videos The
description of each surveillance video types’ time length in
training and testing set is shown in Figure 8 Moreover, the
scenes in videos are the street scene captured in 2 different
camera angles: left angle (3 hours 39 minutes and 53 seconds)
and right angle (4 hours 19 minutes and 52 seconds) The
diversity of scenes, view angles, light conditions reflect the
real-world environment, which also poses a challenge for
anomaly detection
(a) Training set (b) Testing set Fig 8 The time-length of different surveillance video types
C Annotation
Ground truths for each testing video are annotated in tem-poral form We divide the VNAnomaly dataset into separate parts for annotators Then we have annotators cross-check and re-check the labels ourselves for mistaken checking Our annotation form follows the annotation form in [1], [4] i.e the start and end frames of the anomalous event in each testing anomalous event
IV EVALUATION
A Methods
In this study, we conduct evaluation of three state-of-the-art unsupervised anomaly detection methods: Future frame prediction [1], Margin learning embedded prediction (MLEP) [5], and Learning Memory-guided Normality for Anomaly Detection (MNAD) [6]
baselines for anomaly detection problems It used a gen-erative adversarial network (GAN) [11] to exploit normal patterns and predict the next frame This method focused
on predicting the future frame and uses the predicting error to calculate the anomaly score The illustration of the training process is described in Figure 2
on ConvLSTM [12] In this method, the feature cor-responding to the hidden state of the last input in the ConvLSTM was fed to the margin learning module This module learned a more compact normal data distribution and enlarged the margin between normal and abnormal events which improved the model’s ability to discriminate between normal and abnormal frames
unsupervised methods In this model, a new memory module to record prototypical patterns of normal data on the items in the memory was added Moreover, feature compactness and separateness losses to train the memory module were also proposed
B Measurement
Evaluating scores in most of the works for anomaly detec-tion [1], [5], [6], are calculated based on the receiver operating characteristic curve (ROC curve) The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings The final score is the area under the ROC curve (AUC score) AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1) The higher the AUC score is, the better
Trang 5model distinguishes between normal and anomalous frames.
AUC scores of experiment results are visualized in Figure 9
Fig 9 AUC score visualization of state-of-the-art methods on VNAnomaly
C Modifying inference stage
When inspecting the visualization of the result of MNAD,
we notice that there is a considerable fluctuation of the
anomaly scores The model can predict the anomaly frames
due to the powerful representation capacity of the CNNs
Therefore, the scores only go on to the bottom in the
begin-AUC = 69.27
(a) MNAD
AUC = 73.66
(b) MNAD w/ modified inference Fig 10 The anomaly scores visualization of MNAD and MNAD with
modified inference on video Right_Robbery_971 In this figure, the
ground-truth score for the normal frame is 1, the abnormal frame’s ground ground-truth score
is vice versa.
ning After that, it starts to increase as the model can predict
the abnormal frames, see figure 10
To mitigate the above issues, we alter the inference stage to
stabilize the scores Instead of using four real previous frames
as an input, we mix the current predicted nearest frame with
the frame that is predicted in the earlier step, see Figure 11 If
there are consecutive anomaly frames then this alteration will
increase the predicting error in the later frames and stabilize
the anomaly scores when frames are anomalous On the other
hand, when the generator returns a high-quality frame, the
inference process is similar to the base ones Moreover, this
alteration should not affect the generator’s ability to predict the
normal frames because the frames are mixed with a low ratio
(1:3) The formation of the mixing strategy is demonstrated
below:
I1
i “ λ1ˆ
i` λ2Ii Where I1
idenotes the mixed frame, ˆIiis the predicted frame,
and Ii is the real frame λ1 and λ2 are chosen using a grid
search: λ1 “ 0.25, λ2 “ 0.75 If the ratio of the predicted
frame information is higher than the actual frame information,
the anomaly score will be more stable However, it will limit the discriminant ability of the model Thus, the 1:3 ratio is determined by using grid search to balance the elements of the predicted and actual frames This alteration slightly increases the effectiveness of MNAD from 0.5% to 1.14% AUC score
Generator
Generator
Fig 11 The desmonstration of modified inference process.
V EXPERIMENT
A Experiment setup
We follow the setting of the Future Frame Prediction method [1] to split the existing datasets for the experiment In addition, the training and testing set of the VNAnomaly dataset is split similarly to other unsupervised datasets’ settings: The training set contains normal videos, whereas the testing set contains abnormal videos The whole process is implemented
on GeForce RTX 2080 Ti GPU with memory 11019MiB For the VNAnomaly dataset, we train Future frame prediction, MLEP, and MNAD model for five epochs with batch sizes
of 8, 8, and 4, respectively Other hyper-parameters are set
to default In addition, we follow the setting proposed in the original paper for benchmark datasets
B Experimental results and discussion
Following the aforementioned experiment setting, we inten-sively experiment with three state-of-the-art methods: Future frame prediction [1], MLEP [5], and MNAD [6] Table II summarizes the performance comparison of these methods
on VNAnomaly and other publicly available datasets namely UCSD Ped 1 [8], UCSD Ped 2 [8], Subway Entrance [9], CUHK Avenue [10], ShanghaiTech [1] based on AUC metrics Table II uses AUC at the frame level to show the perfor-mance comparison of various methods on 6 selected video datasets Overall, our modified inference positively affects the MNAD [6] model are competitive results, compared to the current state-of-the-art methods benchmark datasets and outperforms the competitors on VNAnomaly As regards the CUHK Avenue [10], our method is the highest at 89.30% whereas the lowest percentage of MLEP method [5] was at 82.29% Additionally, the figure for the proposed method is slightly higher than the other methods ranged from nearly 0.5% to approximately 3% Out of the other methods, the proposed method has the highest figure at 61.04% and it al-ways outperforms the competitors on the VNAnomaly dataset ranging from over 1% to 3% On the other hand, the figure for the proposed method has higher than MLEP method [5] and MNAD method [6] on Ped 1 dataset [8] at 82.15%, compared
Trang 6TABLE II AUC OF DIFFERENT METHODS ON THE P ED 1, P ED 2, S UBWAY E NTRANCE , A VENUE , S HANGHAI T ECH AND VNA NOMALY DATASETS
Ped 1 [8] Ped 2 [8] Subway Entrance [9] CUHK Avenue [10] ShanghaiTech [1] VNAnomaly (our)
*The result achieved in the original paper is evaluated in different settings
to MLEP method [5] at 75.37% and MNAD method [6] at
80.37%, whereas our method is slightly lower than Future
frame prediction [1] about 0.18% As indicated clearly from
the Ped 2 dataset [8], the proportion of the proposed method
is the same figures, compared to the other methods at nearly
97.00% Similarly, the proportion of the proposed method is
slightly higher than that of Future frame prediction method
[1] and MNAD method [6] on Subway Entrance dataset [9]
at 72.14%, 71.72% and 69.37%, respectively Likewise, the
number of the proposed method is similar to that of the MLEP
method [5] and MNAD method [6] on ShanghaiTech dataset
[1] at about 70.50% but this figure is slightly lower than that
of the Future frame prediction method [1] about 2% It is
important to point out that the issue of unsupervised anomaly
detection problems in videos is still challenging depending on
a specific context
It is noteworthy that the runtime difference between the
modified and base methods is negligible at 1743,144 seconds
on VNAnomaly and 1744,257 seconds for the modified
infer-ence methods, respectively
VI CONCLUSION
In this paper, we introduce a new dataset named
VNAnomaly for anomaly detection problem in videos The
dataset contains different scenes, multi-objects, and 4 common
anomaly types in Vietnam’s street It is one of the first anomaly
datasets to capture the scenes in Vietnam and reflect the
real-world challenge with the variety of angles and diversity
of time Furthermore, we also conduct extensive experiments
using three unsupervised anomaly detection methods wherein
we adopt a new modified inference for the MNAD method
Our experiments demonstrate improvements over
state-of-the-art methods on real-world datasets In our future work, we
plan to collect more anomaly types in Vietnam to increase the
diversity of the dataset, continue improving anomaly detection
models, and try to implement on edge devices
ACKNOWLEDGEMENT
This work was supported by the Multimedia Processing
Lab (MMLab) at the University of Information Technology,
VNUHCM
REFERENCES
[1] W Liu, W Luo, D Lian, and S Gao, “Future frame
prediction for anomaly detection - a new baseline,” in
2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2018, pp 6536–6545
[2] B Ramachandra, M J Jones, and R R Vatsavai,
“A survey of single-scene video anomaly detection,”
CoRR, vol abs/2004.05993, 2020 arXiv: 2004.05993 [Online] Available: https://arxiv.org/abs/2004.05993 [3] S Zhu, C Chen, and W Sultani, “Video anomaly
detec-tion for smart surveillance,” CoRR, vol abs/2004.00222,
2020 arXiv: 2004 00222 [Online] Available: https : //arxiv.org/abs/2004.00222
[4] W Sultani, C Chen, and M Shah, “Real-world anomaly
detection in surveillance videos,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion, IEEE, Jun 2018 [Online] Available: https://doi org/10.1109/cvpr.2018.00678
[5] W Liu, W Luo, Z Li, P Zhao, and S Gao, “Margin learning embedded prediction for video anomaly
detec-tion with a few anomalies,” in IJCAI, 2019.
[6] H Park, J Noh, and B Ham, “Learning memory-guided
normality for anomaly detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition (CVPR), IEEE, Jun 2020 [Online] Available: https://doi.org/10.1109/cvpr42600.2020.01438
[7] M.-I Georgescu, A Barbalau, R T Ionescu, F S Khan, M Popescu, and M Shah, “Anomaly detection
in video via self-supervised and multi-task learning,”
CoRR, vol abs/2011.07491, 2020 arXiv: 2011.07491 [Online] Available: https://arxiv.org/abs/2011.07491 [8] V Mahadevan, W Li, V Bhalodia, and N Vasconcelos,
“Anomaly detection in crowded scenes,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp 1975–1981
[9] A Adam, E Rivlin, I Shimshoni, and D Reinitz,
“Robust real-time unusual event detection using
mul-tiple fixed-location monitors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 30,
no 3, pp 555–560, Mar 2008 [Online] Available: https://doi.org/10.1109/tpami.2007.70825
[10] C Lu, J Shi, and J Jia, “Abnormal event detection
at 150 fps in matlab,” in 2013 IEEE International Conference on Computer Vision, 2013, pp 2720–2727 [11] I Goodfellow, J Pouget-Abadie, M Mirza, B Xu, D Warde-Farley, S Ozair, A Courville, and Y Bengio,
“Generative adversarial networks,” Communications of the ACM, vol 63, no 11, pp 139–144, Oct 2020 [Online] Available: https://doi.org/10.1145/3422622 [12] S Xingjian, Z Chen, H Wang, D.-Y Yeung, W.-K Wong, and W.-c Woo, “Convolutional lstm network:
A machine learning approach for precipitation
now-casting,” in Advances in neural information processing