Uit anomaly a modern vietnamese video dataset for anomaly detection

Moreover, we apply a method for weakly supervised video anomaly detection, called Robust Temporal Feature Magnitude learning RTFM based on feature magnitude learning to detect abnormal s

Trang 1

UIT-Anomaly: A Modern Vietnamese Video

Dataset for Anomaly Detection

Dung T.T Vo

Vietnam National University University of Information Technology

Ho Chi Minh City, Vietnam 18520641@gm.uit.edu.vn

Tung Minh Tran

Ho Chi Minh City, Vietnam tungtm.ncs@grad.uit.edu.vn Nguyen D Vo

Ho Chi Minh City, Vietnam nguyenvd@uit.edu.vn

Khang Nguyen

Ho Chi Minh City, Vietnam khangnttm@uit.edu.vn

Abstract—Anomaly detection in videos is of utmost importance

for numerous tasks in the field of computer vision We introduce

the UIT-Anomaly dataset captured in Vietnam with a total

duration of 200 minutes It contains 224 videos with six different

types of anomalies Moreover, we apply a method for weakly

supervised video anomaly detection, called Robust Temporal

Feature Magnitude learning (RTFM) based on feature magnitude

learning to detect abnormal snippets The approached method

yields competitive results, compared to other state-of-the-art

algorithms using publicly available datasets such as ShanghaiTech

and UCF–Crime.

Index Terms—Anomaly Detection, Weakly Supervision,

Multi-ple Instance Learning.

I INTRODUCTION Nowadays, remote anomaly detection has become more

popular due to the increase in the number of surveillance

cameras However, these surveillance systems are still not

timely and require manual labour Therefore, it is necessary

that we leverage the power of computer vision to automatically

detect anomalies in videos The purpose of this problem is to

find out a model to accurately identify the start and the end

points of an anomalous event The input and output of the

problem are demonstrated in Figure 1

Fig 1 Demonstration of the input and output of the problem Input: videos

filmed in Vietnam Output: determine which time window (the starting frame

to the ending frame) contains the abnormal events.

In this paper, we apply a method for weakly supervised

anomaly detection called RTFM by learning the annotated

training videos at the video level Furthermore, each video is

represented as a bag of video snippets, this anomaly detection approach is based on the temporal feature magnitude of the snippets in the video Specifically, normal snippets are represented with low feature magnitude, and abnormal snippets are denoted by high feature magnitude In this approach,

k different snippets with the highest feature magnitude will

be selected from normal and abnormal videos, leading to a probability of selected abnormal snippets in anomalous videos higher than that of the MIL method [9] wherein these k snippets play a vital role in training a snippet classifier One of the biggest challenges in the anomaly detection problem in Vietnam is the lack of data Benchmark datasets for this problem are often taken from movies, and they are rarely extracted from surveillance cameras Hence, they can not provide high realism and distinctive features such

as settings, individuals, or forms of violence in Vietnam Therefore, we built a novel dataset of normal and abnormal videos in Vietnam, called UIT-Anomaly Our dataset consists

of 224 videos with six types of unusual behavior common in Vietnam All the videos in our dataset capture actual events in

a variety of contexts, which makes UIT-Anomaly more diverse than other benchmark datasets, which are presented in Part IV

II RELATEDWORK

In this section, we present two aspects which are anomaly detection and the multiple instance learning method (MIL) [9]

A Anomaly Detection

There are three main approaches to solve the anomaly detection problem in videos namely unsupervised anomaly de-tection, supervised anomaly dede-tection, and weakly supervised detection

1) Unsupervised anomaly detection: as mentioned earlier, one of the challenges of anomaly detection is the lack of data Anomalies do not often occur in real life, which hinders collecting data in a variety of contexts By contrast, normal

Trang 2

samples are easy to collect and do not take much time.

Therefore, only normal videos in the training set are used in

this approach This helps to save time and effort when building

a dataset

2) Supervised anomaly detection: with this supervised

ap-proach, frame-level annotation is required for both the training

and the test sets, which means distinguishing between

abnor-mal and norabnor-mal frames This is the most expensive step in the

process of building a dataset

3) Weakly supervised anomaly detection: in this approach,

the training set is annotated at the video level, but it is

required to learn frame-level prediction because testing set is

still fully annotated as in the supervised approach Therefore,

compared with the supervised approach, the annotation cost

in the weakly supervised approach is very low, and it achieves

much better performance than the unsupervised approach The

weakly supervised approach is the best option for the anomaly

detection problem

B MIL Method

Sultani et al [1] proposed a method for anomaly detection

based on multiple instance learning In this method, before

extracting features, the frame rate of the video was changed

to 30 fps with a size of 240 ˆ 320 pixels Each video was

represented as a bag including 32 different snippets Each

snippet was divided into 16 different frames to extract features

at the FC6 layer of the C3D network Furthermore, the feature

vector of each set of 16 frames was a 4096D one This method

calculated the average of all features of sets of 16 frames in the

snippet based on the l2normalization and used it as the feature

of the whole snippet, so the feature vector of each snippet was

also 4096D In the step of snippet classification, the features

of all snippets of a video are put into the FC neural network

The input was a 4096D vector which has 3 layers containing

512, 32, and 1 units, respectively There was a 60% dropout

[8] between the layers Moreover, the method also used ReLU

[6] activation for the first FC layer and the Sigmoid activation

for the last FC layer In addition, the anomaly score of each

snippet was considered as the anomaly score of the frames in

it Figure 2 illustrates the MIL method

Fig 2 Deep MIL Ranking method [9]

Although this method is currently working effectively, there

are still many limitations In MIL, each snippet with the

highest anomaly score was used to represent each video, so

,likely, the snippet with the highest anomaly score is not an

anomaly because the abnormal snippets are overwhelmed by

the normal ones In the case of more than one outlier, the

chance to learn more abnormal snippets in each video might

be missed when using this method

III METHODOLOGY

Yu Tian et al [10] proposed a method for anomaly detection called RFTM to improve the MIL method’s drawbacks Similar

to MIL, RFTM also detects the anomaly and normal snippets

by learning through weakly labeled videos to identify a snippet

as a normal or an anomaly Each video was presented as a bag consisting of T snippets Snippets in each video were extracted feature via I3D [2] or C3D [11] After that, F represented features in the video that had dimension D from T snippets

Fig 3 RTFM method [10].

The multi-scale temporal network (MTN) [10] incorporated two modules pyramid of dilated convolutions (PDC) [13] and

a temporal self-attention (TSA) [12] on temporal dimension

by capturing the multi-resolution local temporal dependencies and the global temporal dependencies between video snippets, which were presented in Figure 4 Furthermore, the output of MTN was temporal features transformed from F, denoted X

A video or a snippet was classified as outlier or anomaly, the method used l2 normalization to calculate feature magnitude and then selected k snippets with the highest feature magni-tude

Assuming that a normal snippet has a smaller feature magnitude than an anomaly snippet, RFTM optimized

top-k snippets’ average feature magnitude from each video At the Feature Magnitude Learning phase, the highest feature magnitude of normal videos’ snippets was minimized, whereas the highest feature magnitude of anomalous video’s snippets was maximized This helped to increase the ability to classify normal and anomalous videos Finally, the RTFM used k snippets with the highest feature magnitude to train

To sum up, RFTM model training process includes opti-mization of three modules: (1) Multi-scale temporal feature learning; (2) Feature magnitude learning, and (3) Snippet classifier training RTFM’s training process is shown in Figure 3

IV DATASET Nowadays, there are publicly available datasets for anomaly detection problems such as UMN [7], Violent-Flows [3],

Trang 3

Fig 4 Multi-scale temporal network (MTN) [13]

Avenue [5] However, they still have some disadvantages For

example, anomalous behavior is set or extracted from movies

leads to the lack of reality, especially can not express the

context and feature of Vietnam regarding the environment,

culture, people, and type of violence Therefore, we built a

novel surveillance video dataset, called UIT-Anomaly

A Selecting Anomaly Categories

As far as we know, it is difficult to completely define

anomalous behavior because there are a lot of aspects and

different presentations in the real-word, we clearly describe

anomalous activities to minimize the ambiguity in creating

fundamental truth To mitigate the above issues, we consider

the six following anomaly classes: Stealing, Traffic Accident,

Fighting, Unsporting Behavior, and Against We are interested

in these anomalies because they have distinct features in

Viet-nam Additionally, some samples of our dataset are presented

in Figure 5

B Video Collection

We collect videos from Youtube by using keywords like

“street violence”, “street stealing”, “marital combat”, “dog

thief” and other words that have a similar meaning with each

type of anomaly With the normal behavior, we search for

“security camera at school”, “CCTV at street”, “CCTV at

home”, etc Anomalous behavior in Vietnam is not captured

by CCTV much, so we also collect other videos captured

by smartphones, car black boxes However, the number of

videos captured by CCTV still accounts for the majority and

is completely shot from real events

C Video Cleaning

As a rule of the original regulation, we only use videos that

they have not manually edited, so our annotator team checks

each video to make sure The number of videos that satisfy

the regulation is not enough to decide to keep the original videos or the edited videos but they are still intact After that,

we re-edit videos by deleting borders, changing video speed

to normal, etc., intending to make the video as close to the original as possible Furthermore, we also remove videos with excessive modifications or videos with unclear anomalies

D Video Annotation

Our dataset is annotated based on weakly supervised ap-proach so the training set just needs annotating at video-level

In addition, frame-level of the testing set is also annotated to evaluate the performance of the methods on the testing phase, that is, to confirm the start and end frames of anomalous activities The dataset is finally accomplished after intense efforts over several months

E Dataset Statistics

The UIT-Anomaly dataset includes a total of 224 muted videos captured at a frame rate of 30 fps with various resolutions It has 104 normal and 120 anomalous videos The total duration is more than 200 minutes, corresponding to 392,188 frames We divide these videos into two subsets: the training set included 90 abnormal and 90 normal videos, while the test set consisted of the remaining 30 abnormal and the remaining 14 normal videos Both training and test sets contain six classes of anomalies The video distributions in terms of length and number of frame rate test videos are presented in Figures 7, 8, respectively

We have a comparison between the UIT-Anomaly dataset and the others in Table I Our dataset has a size and length that overwhelms the rest of the datasets Overall, abnormal activities in UIT–Anomaly are very different compared to

other datasets’ anomalous activities For example, Against and Dog Thief classes are two anomalies that are very rare in other datasets Regarding diversity, the number of anomalous types in our dataset is larger In addition, we also collect videos in indoor and outdoor places such as streets, homes, restaurants, whereas other datasets only focus on one specific space Moreover, other datasets can not represent specific features for anomalous events in Vietnam

V EXPERIMENTS

A Implementation Details

Within our approach, each video is divided into 32 snippets

We conduct experiments on the UIT-Anomaly dataset with

k = 3, 4, 5, 6, 7, 8, 9, 10, 11 Moreover, 2048D features

are extracted at the mix_5c layer of the pre-trained I3D net.

Furthermore, three FC layers in MTN have 512, 128, and

1 units, respectively After that, each FC layer uses a ReLU activation function with a 70% dropout between FC layers For all experiments, we train the RFTM method by using Adam optimizer with 0.0005 weight decay and 16 batch size

In addition, we set the learning rate at 0.001 for 15000 epochs Additionally, each mini–batch includes samples from

32 normal and abnormal videos which are randomly selected

Trang 4

Fig 5 Some samples of six anomalous activities in UIT-Anomaly dataset.

TABLE I: Comparison between UIT-Anomaly and benchmark datasets.

Avenue [5] 37 30 min Run, throw, new object Campus avenue No UCSD Ped1 [4] 70 5 min Bikers, small carts, walking across walkways Walkway No UCSD Ped2 [4] 28 5 min Bikers, small carts, walking across walkways Walkway No Subway Entrance [1] 1 1.5 hours Wrong direction, No payment Subway No Subway Exit [1] 1 1.5 hours Wrong direction, No payment Subway No

UIT-Anomaly (Ours) 224 3.5 hours Stealing, Traffic Accident, Fighting,

Unsporting Behavior, Against, Dog Thief

Street, house, restaurant, grocery store, office, etc.

Yes

B Performance and Evaluation

From the results published in Table II, we see that the

proposed framework uses k snippets with the largest

fea-ture magnitude achieves effectively detect outcomes It is

noticeable that the ability to distinguish between abnormal

and normal videos tends to go up when k increases due to

expanding the scope of training, which makes the model learn better However, if k is too high, the model will not detect snippet as an anomaly because the abnormal samples are overwhelmed by the normal ones in both normal and abnormal videos

Furthermore, we also compare the results in two cases: training from scratch and using the highest pre-trained sets

Trang 5

Fig 6 Distribution of videos according to length in both two sets.

Fig 7 Distribution of videos according to length in both two sets.

Fig 8 Distribution of video frames in the testing videos.

TABLE II: AUC performance UIT-Anomaly.

k = 10 73.98

k = 11 72.12

of performance parameters on the ShanghaiTech and

UCF-Crime datasets from [10] in Table III Although the parameters

trained on ShanghaiTech and UCF–Crime, achieve high

accu-racy AUC = 97.21% and 84.30%, respectively [10] they have

the lowest performance on UIT–Anomaly dataset Therefore, building a dataset for anomaly detection problems in Vietnam

is essential

TABLE III: Training results comparison with trained parameters from bench-mark datasets in terms of AUC metric (%).

Trained parameters from ShanghaiTech 42.84 Trained parameters from UCF–Crime 49.84

Trained on UIT–Anomaly (k = 9) 76.07

As shown in Figure 9, we visualize anomaly detection on testing videos The videos include well-recorded anomalous

events such as Traffic_Accident_051, Dog_Thief_033

Further-more, the model almost correctly detects the start and end times of unusual event The experimental results on normal

videos also achieve good performance such as Normal_088 and Normal_074, respectively.

However, failure cases are observed in the testing videos

For instance, in terms of Stealing_058 video, it is clearly to

see that the model does not detect any anomalous event in this video because the thief snatches the bag instantaneously and the CCTV installation position is too high to capture the situation Similarly, it does not detect the start and end times

of the anomalous event occurring during in a soccer match in

Unsporting_Behavior_008 This happens because the constant and disordered movement of the players can make detection difficult for the model, which can potentially lead to anomaly detection

For videos with a lot of anomalies occurring in a very short

time such as Fighting_100 video, RFTM cannot detect unusual

events separately but it treats them as a whole anomalous event This is one of the challenges of the UIT-Anomaly dataset, even a SOTA method like RTFM that achieves high performance on other benchmark datasets still has some draw-backs when testing on it

VI CONCLUSION

We introduce the UIT-Anomaly dataset, a novel dataset for video anomaly detection in Vietnam Moreover, frame-level ground truth labels for anomalous events in the videos are provided to evaluate anomaly detection methods across

a variety of approaches Several state-of-the-art methods are thoroughly evaluated on this dataset and show that there is still room for improvement We hope that the proposed dataset will stimulate the development of new weakly unsupervised or unsupervised anomaly detection methods

This work was supported by the Multimedia Processing Lab (MMLab) at the University of Information Technol-ogy,VNUHCM We would also like to show our gratitude to the UIT-Together research group for sharing their pearls of wisdom with us during this research

Trang 6

Fig 9 Qualitative results of RTFM method on testing videos Pink window shows ground truth anomalous region In order from left to right, from top to bottom:

Traffic_Accident_051, Normal_088, Normal_074, Dog_Thief_033, Against_033, Stealing_051, Stealing_058, Unsporting_Behavior_008, and Fighting_100

REFERENCES [1] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv

Reinitz “Robust real-time unusual event detection using

multiple fixed-location monitors” In: IEEE

transac-tions on pattern analysis and machine intelligence30.3

(2008), pp 555–560

[2] Joao Carreira and Andrew Zisserman “Quo vadis,

ac-tion recogniac-tion? a new model and the kinetics dataset”

In: proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition 2017, pp 6299–6308

[3] Tal Hassner, Yossi Itcher, and Orit Kliper-Gross

“Vio-lent flows: Real-time detection of vio“Vio-lent crowd

behav-ior” In: 2012 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition Workshops

IEEE 2012, pp 1–6

[4] Weixin Li, Vijay Mahadevan, and Nuno

Vasconce-los “Anomaly detection and localization in crowded

scenes” In: IEEE transactions on pattern analysis and

machine intelligence 36.1 (2013), pp 18–32

[5] Cewu Lu, Jianping Shi, and Jiaya Jia “Abnormal event

detection at 150 fps in matlab” In: Proceedings of

the IEEE international conference on computer vision

2013, pp 2720–2727

[6] Vinod Nair and Geoffrey E Hinton “Rectified linear

units improve restricted boltzmann machines” In: Icml.

2010

[7] R Raghavendra, AD Bue, and M Cristani Unusual crowd activity dataset of University of Minnesota 2006 [8] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov “Dropout: a simple way to prevent neural networks from

overfit-ting” In: The journal of machine learning research 15.1

(2014), pp 1929–1958

[9] Waqas Sultani, Chen Chen, and Mubarak Shah “Real-world anomaly detection in surveillance videos” In:

Proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp 6479–6488 [10] Yu Tian et al “Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude

Learning” In: arXiv preprint arXiv:2101.10030 (2021).

[11] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Tor-resani, and Manohar Paluri “Learning spatiotemporal

features with 3d convolutional networks” In: Proceed-ings of the IEEE international conference on computer vision 2015, pp 4489–4497

[12] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and

Kaiming He “Non-local neural networks” In: Proceed-ings of the IEEE conference on computer vision and pattern recognition 2018, pp 7794–7803

[13] Fisher Yu and Vladlen Koltun “Multi-scale context

aggregation by dilated convolutions” In: arXiv preprint arXiv:1511.07122(2015)

Tiêu đề	UIT-Anomaly: A Modern Vietnamese Video Dataset for Anomaly Detection
Tác giả	Dung T.T. Vo, Tung Minh Tran, Nguyen D. Vo, Khang Nguyen
Trường học	Vietnam National University, University of Information Technology
Chuyên ngành	Computer Vision
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	6
Dung lượng	14,18 MB