Moreover, we apply a method for weakly supervised video anomaly detection, called Robust Temporal Feature Magnitude learning RTFM based on feature magnitude learning to detect abnormal s
Trang 1UIT-Anomaly: A Modern Vietnamese Video
Dataset for Anomaly Detection
Dung T.T Vo
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam 18520641@gm.uit.edu.vn
Tung Minh Tran
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam tungtm.ncs@grad.uit.edu.vn Nguyen D Vo
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam nguyenvd@uit.edu.vn
Khang Nguyen
Vietnam National University University of Information Technology
Ho Chi Minh City, Vietnam khangnttm@uit.edu.vn
Abstract—Anomaly detection in videos is of utmost importance
for numerous tasks in the field of computer vision We introduce
the UIT-Anomaly dataset captured in Vietnam with a total
duration of 200 minutes It contains 224 videos with six different
types of anomalies Moreover, we apply a method for weakly
supervised video anomaly detection, called Robust Temporal
Feature Magnitude learning (RTFM) based on feature magnitude
learning to detect abnormal snippets The approached method
yields competitive results, compared to other state-of-the-art
algorithms using publicly available datasets such as ShanghaiTech
and UCF–Crime.
Index Terms—Anomaly Detection, Weakly Supervision,
Multi-ple Instance Learning.
I INTRODUCTION Nowadays, remote anomaly detection has become more
popular due to the increase in the number of surveillance
cameras However, these surveillance systems are still not
timely and require manual labour Therefore, it is necessary
that we leverage the power of computer vision to automatically
detect anomalies in videos The purpose of this problem is to
find out a model to accurately identify the start and the end
points of an anomalous event The input and output of the
problem are demonstrated in Figure 1
Fig 1 Demonstration of the input and output of the problem Input: videos
filmed in Vietnam Output: determine which time window (the starting frame
to the ending frame) contains the abnormal events.
In this paper, we apply a method for weakly supervised
anomaly detection called RTFM by learning the annotated
training videos at the video level Furthermore, each video is
represented as a bag of video snippets, this anomaly detection approach is based on the temporal feature magnitude of the snippets in the video Specifically, normal snippets are represented with low feature magnitude, and abnormal snippets are denoted by high feature magnitude In this approach,
k different snippets with the highest feature magnitude will
be selected from normal and abnormal videos, leading to a probability of selected abnormal snippets in anomalous videos higher than that of the MIL method [9] wherein these k snippets play a vital role in training a snippet classifier One of the biggest challenges in the anomaly detection problem in Vietnam is the lack of data Benchmark datasets for this problem are often taken from movies, and they are rarely extracted from surveillance cameras Hence, they can not provide high realism and distinctive features such
as settings, individuals, or forms of violence in Vietnam Therefore, we built a novel dataset of normal and abnormal videos in Vietnam, called UIT-Anomaly Our dataset consists
of 224 videos with six types of unusual behavior common in Vietnam All the videos in our dataset capture actual events in
a variety of contexts, which makes UIT-Anomaly more diverse than other benchmark datasets, which are presented in Part IV
II RELATEDWORK
In this section, we present two aspects which are anomaly detection and the multiple instance learning method (MIL) [9]
A Anomaly Detection
There are three main approaches to solve the anomaly detection problem in videos namely unsupervised anomaly de-tection, supervised anomaly dede-tection, and weakly supervised detection
1) Unsupervised anomaly detection: as mentioned earlier, one of the challenges of anomaly detection is the lack of data Anomalies do not often occur in real life, which hinders collecting data in a variety of contexts By contrast, normal
Trang 2samples are easy to collect and do not take much time.
Therefore, only normal videos in the training set are used in
this approach This helps to save time and effort when building
a dataset
2) Supervised anomaly detection: with this supervised
ap-proach, frame-level annotation is required for both the training
and the test sets, which means distinguishing between
abnor-mal and norabnor-mal frames This is the most expensive step in the
process of building a dataset
3) Weakly supervised anomaly detection: in this approach,
the training set is annotated at the video level, but it is
required to learn frame-level prediction because testing set is
still fully annotated as in the supervised approach Therefore,
compared with the supervised approach, the annotation cost
in the weakly supervised approach is very low, and it achieves
much better performance than the unsupervised approach The
weakly supervised approach is the best option for the anomaly
detection problem
B MIL Method
Sultani et al [1] proposed a method for anomaly detection
based on multiple instance learning In this method, before
extracting features, the frame rate of the video was changed
to 30 fps with a size of 240 ˆ 320 pixels Each video was
represented as a bag including 32 different snippets Each
snippet was divided into 16 different frames to extract features
at the FC6 layer of the C3D network Furthermore, the feature
vector of each set of 16 frames was a 4096D one This method
calculated the average of all features of sets of 16 frames in the
snippet based on the l2normalization and used it as the feature
of the whole snippet, so the feature vector of each snippet was
also 4096D In the step of snippet classification, the features
of all snippets of a video are put into the FC neural network
The input was a 4096D vector which has 3 layers containing
512, 32, and 1 units, respectively There was a 60% dropout
[8] between the layers Moreover, the method also used ReLU
[6] activation for the first FC layer and the Sigmoid activation
for the last FC layer In addition, the anomaly score of each
snippet was considered as the anomaly score of the frames in
it Figure 2 illustrates the MIL method
Fig 2 Deep MIL Ranking method [9]
Although this method is currently working effectively, there
are still many limitations In MIL, each snippet with the
highest anomaly score was used to represent each video, so
,likely, the snippet with the highest anomaly score is not an
anomaly because the abnormal snippets are overwhelmed by
the normal ones In the case of more than one outlier, the
chance to learn more abnormal snippets in each video might
be missed when using this method
III METHODOLOGY
Yu Tian et al [10] proposed a method for anomaly detection called RFTM to improve the MIL method’s drawbacks Similar
to MIL, RFTM also detects the anomaly and normal snippets
by learning through weakly labeled videos to identify a snippet
as a normal or an anomaly Each video was presented as a bag consisting of T snippets Snippets in each video were extracted feature via I3D [2] or C3D [11] After that, F represented features in the video that had dimension D from T snippets
Fig 3 RTFM method [10].
The multi-scale temporal network (MTN) [10] incorporated two modules pyramid of dilated convolutions (PDC) [13] and
a temporal self-attention (TSA) [12] on temporal dimension
by capturing the multi-resolution local temporal dependencies and the global temporal dependencies between video snippets, which were presented in Figure 4 Furthermore, the output of MTN was temporal features transformed from F, denoted X
A video or a snippet was classified as outlier or anomaly, the method used l2 normalization to calculate feature magnitude and then selected k snippets with the highest feature magni-tude
Assuming that a normal snippet has a smaller feature magnitude than an anomaly snippet, RFTM optimized
top-k snippets’ average feature magnitude from each video At the Feature Magnitude Learning phase, the highest feature magnitude of normal videos’ snippets was minimized, whereas the highest feature magnitude of anomalous video’s snippets was maximized This helped to increase the ability to classify normal and anomalous videos Finally, the RTFM used k snippets with the highest feature magnitude to train
To sum up, RFTM model training process includes opti-mization of three modules: (1) Multi-scale temporal feature learning; (2) Feature magnitude learning, and (3) Snippet classifier training RTFM’s training process is shown in Figure 3
IV DATASET Nowadays, there are publicly available datasets for anomaly detection problems such as UMN [7], Violent-Flows [3],
Trang 3Fig 4 Multi-scale temporal network (MTN) [13]
Avenue [5] However, they still have some disadvantages For
example, anomalous behavior is set or extracted from movies
leads to the lack of reality, especially can not express the
context and feature of Vietnam regarding the environment,
culture, people, and type of violence Therefore, we built a
novel surveillance video dataset, called UIT-Anomaly
A Selecting Anomaly Categories
As far as we know, it is difficult to completely define
anomalous behavior because there are a lot of aspects and
different presentations in the real-word, we clearly describe
anomalous activities to minimize the ambiguity in creating
fundamental truth To mitigate the above issues, we consider
the six following anomaly classes: Stealing, Traffic Accident,
Fighting, Unsporting Behavior, and Against We are interested
in these anomalies because they have distinct features in
Viet-nam Additionally, some samples of our dataset are presented
in Figure 5
B Video Collection
We collect videos from Youtube by using keywords like
“street violence”, “street stealing”, “marital combat”, “dog
thief” and other words that have a similar meaning with each
type of anomaly With the normal behavior, we search for
“security camera at school”, “CCTV at street”, “CCTV at
home”, etc Anomalous behavior in Vietnam is not captured
by CCTV much, so we also collect other videos captured
by smartphones, car black boxes However, the number of
videos captured by CCTV still accounts for the majority and
is completely shot from real events
C Video Cleaning
As a rule of the original regulation, we only use videos that
they have not manually edited, so our annotator team checks
each video to make sure The number of videos that satisfy
the regulation is not enough to decide to keep the original videos or the edited videos but they are still intact After that,
we re-edit videos by deleting borders, changing video speed
to normal, etc., intending to make the video as close to the original as possible Furthermore, we also remove videos with excessive modifications or videos with unclear anomalies
D Video Annotation
Our dataset is annotated based on weakly supervised ap-proach so the training set just needs annotating at video-level
In addition, frame-level of the testing set is also annotated to evaluate the performance of the methods on the testing phase, that is, to confirm the start and end frames of anomalous activities The dataset is finally accomplished after intense efforts over several months
E Dataset Statistics
The UIT-Anomaly dataset includes a total of 224 muted videos captured at a frame rate of 30 fps with various resolutions It has 104 normal and 120 anomalous videos The total duration is more than 200 minutes, corresponding to 392,188 frames We divide these videos into two subsets: the training set included 90 abnormal and 90 normal videos, while the test set consisted of the remaining 30 abnormal and the remaining 14 normal videos Both training and test sets contain six classes of anomalies The video distributions in terms of length and number of frame rate test videos are presented in Figures 7, 8, respectively
We have a comparison between the UIT-Anomaly dataset and the others in Table I Our dataset has a size and length that overwhelms the rest of the datasets Overall, abnormal activities in UIT–Anomaly are very different compared to
other datasets’ anomalous activities For example, Against and Dog Thief classes are two anomalies that are very rare in other datasets Regarding diversity, the number of anomalous types in our dataset is larger In addition, we also collect videos in indoor and outdoor places such as streets, homes, restaurants, whereas other datasets only focus on one specific space Moreover, other datasets can not represent specific features for anomalous events in Vietnam
V EXPERIMENTS
A Implementation Details
Within our approach, each video is divided into 32 snippets
We conduct experiments on the UIT-Anomaly dataset with
k = 3, 4, 5, 6, 7, 8, 9, 10, 11 Moreover, 2048D features
are extracted at the mix_5c layer of the pre-trained I3D net.
Furthermore, three FC layers in MTN have 512, 128, and
1 units, respectively After that, each FC layer uses a ReLU activation function with a 70% dropout between FC layers For all experiments, we train the RFTM method by using Adam optimizer with 0.0005 weight decay and 16 batch size
In addition, we set the learning rate at 0.001 for 15000 epochs Additionally, each mini–batch includes samples from
32 normal and abnormal videos which are randomly selected
Trang 4Fig 5 Some samples of six anomalous activities in UIT-Anomaly dataset.
TABLE I: Comparison between UIT-Anomaly and benchmark datasets.
Avenue [5] 37 30 min Run, throw, new object Campus avenue No UCSD Ped1 [4] 70 5 min Bikers, small carts, walking across walkways Walkway No UCSD Ped2 [4] 28 5 min Bikers, small carts, walking across walkways Walkway No Subway Entrance [1] 1 1.5 hours Wrong direction, No payment Subway No Subway Exit [1] 1 1.5 hours Wrong direction, No payment Subway No
UIT-Anomaly (Ours) 224 3.5 hours Stealing, Traffic Accident, Fighting,
Unsporting Behavior, Against, Dog Thief
Street, house, restaurant, grocery store, office, etc.
Yes
B Performance and Evaluation
From the results published in Table II, we see that the
proposed framework uses k snippets with the largest
fea-ture magnitude achieves effectively detect outcomes It is
noticeable that the ability to distinguish between abnormal
and normal videos tends to go up when k increases due to
expanding the scope of training, which makes the model learn better However, if k is too high, the model will not detect snippet as an anomaly because the abnormal samples are overwhelmed by the normal ones in both normal and abnormal videos
Furthermore, we also compare the results in two cases: training from scratch and using the highest pre-trained sets
Trang 5Fig 6 Distribution of videos according to length in both two sets.
Fig 7 Distribution of videos according to length in both two sets.
Fig 8 Distribution of video frames in the testing videos.
TABLE II: AUC performance UIT-Anomaly.
k = 10 73.98
k = 11 72.12
of performance parameters on the ShanghaiTech and
UCF-Crime datasets from [10] in Table III Although the parameters
trained on ShanghaiTech and UCF–Crime, achieve high
accu-racy AUC = 97.21% and 84.30%, respectively [10] they have
the lowest performance on UIT–Anomaly dataset Therefore, building a dataset for anomaly detection problems in Vietnam
is essential
TABLE III: Training results comparison with trained parameters from bench-mark datasets in terms of AUC metric (%).
Trained parameters from ShanghaiTech 42.84 Trained parameters from UCF–Crime 49.84
Trained on UIT–Anomaly (k = 9) 76.07
As shown in Figure 9, we visualize anomaly detection on testing videos The videos include well-recorded anomalous
events such as Traffic_Accident_051, Dog_Thief_033
Further-more, the model almost correctly detects the start and end times of unusual event The experimental results on normal
videos also achieve good performance such as Normal_088 and Normal_074, respectively.
However, failure cases are observed in the testing videos
For instance, in terms of Stealing_058 video, it is clearly to
see that the model does not detect any anomalous event in this video because the thief snatches the bag instantaneously and the CCTV installation position is too high to capture the situation Similarly, it does not detect the start and end times
of the anomalous event occurring during in a soccer match in
Unsporting_Behavior_008 This happens because the constant and disordered movement of the players can make detection difficult for the model, which can potentially lead to anomaly detection
For videos with a lot of anomalies occurring in a very short
time such as Fighting_100 video, RFTM cannot detect unusual
events separately but it treats them as a whole anomalous event This is one of the challenges of the UIT-Anomaly dataset, even a SOTA method like RTFM that achieves high performance on other benchmark datasets still has some draw-backs when testing on it
VI CONCLUSION
We introduce the UIT-Anomaly dataset, a novel dataset for video anomaly detection in Vietnam Moreover, frame-level ground truth labels for anomalous events in the videos are provided to evaluate anomaly detection methods across
a variety of approaches Several state-of-the-art methods are thoroughly evaluated on this dataset and show that there is still room for improvement We hope that the proposed dataset will stimulate the development of new weakly unsupervised or unsupervised anomaly detection methods
This work was supported by the Multimedia Processing Lab (MMLab) at the University of Information Technol-ogy,VNUHCM We would also like to show our gratitude to the UIT-Together research group for sharing their pearls of wisdom with us during this research
Trang 6Fig 9 Qualitative results of RTFM method on testing videos Pink window shows ground truth anomalous region In order from left to right, from top to bottom:
Traffic_Accident_051, Normal_088, Normal_074, Dog_Thief_033, Against_033, Stealing_051, Stealing_058, Unsporting_Behavior_008, and Fighting_100
REFERENCES [1] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv
Reinitz “Robust real-time unusual event detection using
multiple fixed-location monitors” In: IEEE
transac-tions on pattern analysis and machine intelligence30.3
(2008), pp 555–560
[2] Joao Carreira and Andrew Zisserman “Quo vadis,
ac-tion recogniac-tion? a new model and the kinetics dataset”
In: proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition 2017, pp 6299–6308
[3] Tal Hassner, Yossi Itcher, and Orit Kliper-Gross
“Vio-lent flows: Real-time detection of vio“Vio-lent crowd
behav-ior” In: 2012 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops
IEEE 2012, pp 1–6
[4] Weixin Li, Vijay Mahadevan, and Nuno
Vasconce-los “Anomaly detection and localization in crowded
scenes” In: IEEE transactions on pattern analysis and
machine intelligence 36.1 (2013), pp 18–32
[5] Cewu Lu, Jianping Shi, and Jiaya Jia “Abnormal event
detection at 150 fps in matlab” In: Proceedings of
the IEEE international conference on computer vision
2013, pp 2720–2727
[6] Vinod Nair and Geoffrey E Hinton “Rectified linear
units improve restricted boltzmann machines” In: Icml.
2010
[7] R Raghavendra, AD Bue, and M Cristani Unusual crowd activity dataset of University of Minnesota 2006 [8] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov “Dropout: a simple way to prevent neural networks from
overfit-ting” In: The journal of machine learning research 15.1
(2014), pp 1929–1958
[9] Waqas Sultani, Chen Chen, and Mubarak Shah “Real-world anomaly detection in surveillance videos” In:
Proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp 6479–6488 [10] Yu Tian et al “Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude
Learning” In: arXiv preprint arXiv:2101.10030 (2021).
[11] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Tor-resani, and Manohar Paluri “Learning spatiotemporal
features with 3d convolutional networks” In: Proceed-ings of the IEEE international conference on computer vision 2015, pp 4489–4497
[12] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and
Kaiming He “Non-local neural networks” In: Proceed-ings of the IEEE conference on computer vision and pattern recognition 2018, pp 7794–7803
[13] Fisher Yu and Vladlen Koltun “Multi-scale context
aggregation by dilated convolutions” In: arXiv preprint arXiv:1511.07122(2015)