UNIVERSITY OF ENGINEERING AND TECHNOLOGY FACULTY OF INFORMATION AND TECHNOLOY Pham Xuan Thanh VEHICLE TRACKING AND CLASSIFICATION ON TRAFFIC SURVEILLANCE VIDEOS FINAL REPORT ADVANCED ARTIFICIAL INTELLIGENCE HA NOI 2022 ABSTRACT Abstract This report mainly tackle the challenge of building a smart traffic surveil lance system particularly in Viet Nam by solving the problem of vehicle tracking using surveillance videos Inspired by popular and efficient techniques in the field of computer vision for.
Trang 1UNIVERSITY OF ENGINEERING AND TECHNOLOGY
FACULTY OF INFORMATION AND TECHNOLOY
Pham Xuan Thanh
VEHICLE TRACKING AND CLASSIFICATION
ON TRAFFIC SURVEILLANCE VIDEOS
FINAL REPORT ADVANCED ARTIFICIAL INTELLIGENCE
HA NOI - 2022
Trang 2Abstract: This report mainly tackle the challenge of building a smart traffic lance system particularly in Viet Nam by solving the problem of vehicle tracking using surveillance videos Inspired by popular and efficient techniques in the field of computer vision for multiple objects tracking, this report propose a pipeline consists of two main parts, vehicle tracking and, vehicle classification The first component responsible for es- timating the location of multiple objects throughout every frame After that, the output
surveil-of this component acts as input for the second component, which is responsible for rizing each type of vehicle In the end, the proposed pipeline is combined with a counting algorithm to solve the problem of vehicle counting This report also explain in detail how
catego-to construct a private dataset that consists of frames extracted from surveillance cameras
by using a labeling tool called UET AILAB Annotation Tool And finally, an empirical results are measured using Multiple Object Tracking Accuracy metric for object tracking task and Accuracy score for object classification task which achieved 58% and 73% on average respectively.
Keywords: deep learning, multiple objects tracking, object classification, vehicle
counting.
Trang 3ABBREVIATIONS
1.1 Motivation 1
1.2 Problem statement 1
1.3 Existed problems 3
1.4 report overview 4
CHAPTER 2 RELATED WORK 5 2.1 Deep learning and Convolutional Neural Networks 5
2.1.1 Development of neural networks and deep learning 5
2.1.2 Applications of deep learning in MOT problem 8
2.2 Multiple objects tracking algorithms 8
2.3 FairMOT - A baseline for one-shot multiple objects tracking 10
2.3.1 Limitation of existed MOT methods 10
2.3.2 Network architecture 11
Trang 42.4 Object classification with EfficientNet 13
2.5 Chapter summary 15
CHPATER 3 THE METHOD 16 3.1 Data preparation 16
3.1.1 Data collection 16
3.1.2 Data annotation tool 18
3.2 Overall pipeline for vehicle tracking and classification 20
3.2.1 Vehicle tracking model 21
3.2.2 Vehicle classification model 23
3.3 Vehicle counting algorithm 25
3.4 Chapter summary 26
CHAPTER 4 RESULTS AND DISCUSSIONS 27 4.1 Evaluation metrics 27
4.1.1 MOTA metric 27
4.1.2 IDF1 metric 28
4.1.3 Accuracy metric 30
4.2 Model construction 30
4.2.1 Data preparation 30
4.2.2 Training process 33
4.3 Empirical results 33
4.3.1 Vehicle tracking model 33
Trang 54.3.2 Vehicle classification model 36 4.3.3 Vehicle counting algorithm 37 4.4 Chapter summary 38
5.1 Solved problems 39 5.2 Further improvements 39
Trang 6List of Figures
1.1 Overview of vehicle tracking and classification process 2
2.1 Multi-layer perceptron architecture 6
2.2 LeNet-5 architecture 6
2.3 Comparison between machine learning and deep learning 7
2.4 Comparison between different MOT methods (a) One anchor contains multiple identities (b) Multiple anchors response for one identity (c) One point for one identity 10
2.5 FairMOT architecture 11
2.6 Model Scaling 14
3.1 Example of dataset 16
3.2 12 possible directions 17
3.3 Vatic annotation tool user interface 19
3.4 UET AILAB Annotation Tool flow 19
3.5 Overall pipeline 21
3.6 FairMOT Encoder-Decoder Network 21
3.7 HRNet body architecture 22
3.8 HRNetv2 head for object detection 22
3.9 Swish activation function 24
Trang 73.10 SE-ResNet module 24
3.11 Vehicle classifier architecture 25
3.12 Example of counting algorithm 26
4.1 Confusion matrix 29
4.2 Ground truth extracting 31
4.3 Data folder structure 32
4.4 Example of classification dataset 32
4.5 Tracking result on a test frame 34
4.6 Example of the misidentified object 35
4.7 Example of inaccurate threshold value 35
4.8 Example of correct classification results (a) Objects is classified as "Bus" (b) Object is classified as "Truck" 36
4.9 Example of noises in training data (a) Object belongs to "Motorbike" class (b) Object belongs to "Car" class (c) Object belongs to "Truck" class 37
Trang 8List of Tables
3.1 Classification model detail hyper-parameters 23
4.2 Tracking model detail hyper-parameters 33
4.3 Classification model detail hyper-parameters 33
4.4 Result of vehicle tracking model on different metrics 34
4.5 Accuracy of different baselines 36 4.6 Percentage of correct prediction produced by the vehicle counting algorithm 38
Trang 9Abbreviations Fullname
AI Artificial Intelligence CNN Convolutional Neural Network DBN Deep Belief Network GPS Global Positioning System MLP Multi Layer Perceptron MOT Multiple Object Tracking ROI Region Of Interest SOT Single Object Tracking
Trang 10CHAPTER 1 INTRODUCTION
According to data published by Vietnam General Statistics Office in 2019 [1],the overall population of Viet Nam is over 96 million, which is the third highestpopulated country in South East Asia region (after Indonesia and Philippines) andranked 15th wordwide Additionally, Viet Nam has an extremely high populationdensity of 290 people per squared kilometer which ranked third in South East Asiaregion Theses numbers represented the up rising of overpopulation and the needs
of infrastructure expansion in most developing countries in Asia On the otherhand, Vietnamese mostly prefer using personal transportation vehicles, typicallymotorcycles or cars, over different types of public transportation such as buses
or express train From January 2020 to December 2020, there are over 4 millionmotorcycles and over 3 millions cars currently registered and actively circulatedaccording to Vietnam Register Administration [2] These two major factors lead
to an uprising traffic jam problem due to the fact that there are way to manypeople at the same location at the same time A simple solution for this problem
is to build a appropriate traffic management system that not only highly accuratebut also extremely efficient This idea inspired the creation of this report to build
an end-to-end pipeline using Deep learning methods, particularly Multiple ObjectTracking, that reduce human factor while maintaining certainty
Trang 11robot navigation Solving object tracking problem can be simplified as a findinganswer of a function that takes images/frames as input.
outputs = f(frames) (1.1)
where frames are extracted from videos A tracking system’s main objective is
to determine information about each object in every frame, such as locations, rections, or sizes By definition, object tracking is an algorithm that tracks thedisplacement of one of several particular objects using cameras to capture a scene.However, using a particular view always brings the possibility of partial occlusion
di-of the targeted object This problem can be solved using more cameras in thesame scenario; nevertheless, this requires translating of the set to 3D measuresand correlation between the cameras’ positions Such requirements can becomeexpensive and computationally heavy for the algorithm’s performance Moreover,the knowledge camera’s position and parameters, or camera calibration, is vulner-able to physical changes like the weather and vandalism There are mainly twoproblems in object tracking:
• Single Object Tracking (SOT)
• Multiple Object Tracking (MOT)
To solve vehicle tracking and classification problem, this report proposed a pipelinedivided into two sub-problems, represented as follow:
• Object tracking and detection: This is the first step in the vehicle trackingand classification problem This task responsible for the estimated location ofobjects and produce appropriate bounding box values
• Object classification: This is the last step in the vehicle tracking and tion problem It takes the output of the previous step as input and categorizeseach object
classifica-Figure 1.1: Overview of vehicle tracking and classification process.
Trang 12The tracking of an object can be as straight forward as to simply detect it in eachframe if the only object was present in the video sequence and compute its dis-placement However, most scenarios contain multiple objects, with the possibility
of occlusion among them To adapt to this fact, other tracking techniques have to
be used Methods like optical flow, Kalman filters, texture matching, and so on cantell us or make predictions of the displacement of said object When tracking mul-tiple targets simultaneously, the main difficulty is the similar appearance caused
by occlusions and interactions between objects, while in SOT, the appearance ofthe target is known Therefore when applying SOT models into the MOT problem,
it leads to poor results by target drift and numerous object identity switch errors
A simple multiple objects tracking can be treated as a multi-variable estimationproblem [8] by finding the optimal sequential states of all the objects as presented
by the following formula:
ˆ
S 1:t = arg max
S 1:t
where S 1:t = S1, S2, , S t is all the sequential state of all the objects from the
first frame to t-th frame and O 1:t = O1, O2, , O t is all the collected sequential
observations of all objects from the first frame to t-th frame.
In the process of making this report, there are some existed problems occurredcan be described as follow:
1 Lack of data: Since this report based on the foundation of deep learning
with convolutional neural network concepts, it requires a large amount of data.When referring to object tracking, it is essential to use videos as the main input
of every model As a result, datasets about vehicles are extremely limiteddue to the limitation required to collect data Additionally, the accuracy
of a tracking system can be restricted when putting in the context of VietNam, where more than 85% of the vehicle are motorcycle, and the populationdensity is enormous To solve this problem, this report created a privatedataset for experimenting and evaluation by extracting videos from availablesurveillance cameras located in Viet Nam Hence the name "Vehicle trackingand classification on surveillance cameras" Additionally, this report also usedpre-trained models to optimize cost and time when training models
Trang 132 Diversity of images: The next problem this report faced is the diversity of
images Input frames can have various times in a day, suffer from exposure
by different environmental factors, multiple view-points, and differential inquality between each camera For example, a tracker is easily identifyingobjects in the morning to not guarantee the performance and reverse As aresult, it is essential to build a generalized tracking system to perform well inmultiple examples
3 Combining models: As mentioned before, the problem of vehicle tracking
and classification is divided into two different tasks Therefore, when ing these models, information can be mistaken By standardized input data,the report can minimize this inaccuracy
to create this implementation
In Chapter 3, the details of the proposed pipeline are delivered This chapter
also discusses some of major difficulties and solutions throughout the process ofbuilding this pipeline
Finally, Chapter 4 shown all empirical experimental results which are
ob-tained and evaluated on the private dataset extracted from surveillance cameras
Trang 14CHAPTER 2 RELATED WORK
This chapter will briefly explain and create an overview of some backgroundtheories as well as analyze some existed methods used to create this report
This section demonstrates the development of deep learning, convolutionalneural networks, and backpropagation algorithm through some of the most iconicarchitectures Later on, this section would discuss the application of deep learning
in the current time
2.1.1 Development of neural networks and deep learning
One of the first foundations of neural network "perceptron learning algorithm"[3] developed by Frank Rosenblatt in 1957 has an enormous impact on the moderndeep learning era Initially, it was a supervised learning algorithm for a binaryclassification problem and converged if the data is linearly separable Even thoughMarvin Minsky and Seymour Papert proved that the perceptron algorithm couldnot "learn" XOR operator in 1969 [4], it was essential for later algorithms In 1986,revised perceptron idea, Geoffrey Hinton published a scientific paper [5] introducing
"Multi-layer perceptron" (MLP) with a training procedure called "backpropagation"which solved the limitation of the original algorithm and also brought the term
"neural nets" of "neural networks" to popular
With the rising of neural nets, Yann Lecun developed "convolutional neural nets"
in 1998 [6] - also known as ConvNet, CNN or LeNet - to solve the problem of ing hand-written digits At the time, it was the best algorithm for this problembased on the ability to extract features of input images through two dimensions
Trang 15read-Figure 2.1: Multi-layer perceptron architecture.
filters Furthermore, these filters’ size is quite small, so it showed that the ber of calculations is smaller and faster than traditional MLP Convolution is amathematical term, here referring to an operation between two matrices The con-volutional layer has a fixed small matrix defined, also called kernel or filter Asthe kernel is sliding, or convolving, across the input image’s matrix representation,
num-it is computing the element-wise multiplication of the values in the kernel matrixand the original image values Specially designed kernels can process images foreveryday purposes like blurring, sharpening, edge detection, and many others, fastand efficiently
Figure 2.2: LeNet-5 architecture.
Other models also expected to solve other image classification problems assuccessful as LeNet; however, they were suffered from both objective and adjective
at the time Convnet models require a large number of input data for the trainingprocess, constrained by the quality of cameras and the labeling resources Eventhough a dataset meets the previous requirements, it also suffered from the limita-tion of computational power in hardware On the other hand, loss function in MLPmodels are not convex functions which make the process of finding a solution forglobal optimization more difficult When the number of hidden layers increases,
it also showed that the training procedure is not as efficient because of the ishing gradient" problem These drawbacks brought neural nets into forgotten for
Trang 16"van-a long time until Hinton introduced "Unsupervised pretr"van-aining" in 2006 through
"Deep belief nets" (DBN) [7] DBN tries to learn to probabilistically reconstructits inputs on a set of examples without training supervision, then these layers act
as feature detectors and can be further trained with supervision to perform specificclassification This idea partly solved the vanishing gradient problem using pre-trained weight matrices in a few first hidden layers instead of calculating the lossfunction derivative of those layers With this publication, neural networks withmultiple hidden layers had been changed into the term "Deep learning"
Deep learning usually refers to a machine learning technique that teachescomputers to do what comes naturally to humans: learn by example It is a crucialtechnology behind various implementations; for example, it helps them recognize astop sign or distinguish a pedestrian from a lamppost in autonomous cars It is alsothe key factor to voice control in smart devices like phones, tablets, TVs, or hands-free speakers Deep learning has been given lots of attention lately, and for a goodreason A computer "model" can be learned to perform classification tasks directlyfrom images, text, or sound in a deep learning context It also can achieve state-of-the-art accuracy, sometimes even out-performing human-level These models aretrained by using a large set of labeled data and neural network architectures thatcontain many layers Typically, a machine learning workflow starts with relevantfeatures being manually extracted from images Later on, the features are used
to create a model that categorizes the objects in the image On the other hand,with a deep learning workflow, relevant features are automatically extracted fromimages Additionally, deep learning performs an "end-to-end" learning process –where a network is given raw data and a task to perform, and it learns how to
do this automatically Another key difference is that deep learning algorithms arescaled with data, whereas shallow learning converges
Figure 2.3: Comparison between machine learning and deep learning.
Trang 172.1.2 Applications of deep learning in MOT problem
In the modern era, neural networks have an important role in countless cations However, there are three main well-known applications of neural networks:
appli-• Computer vision: Neural networks were first known for handling object tion problems Many implementations have been made from these systems,such as developing autonomous cars, building surveillance frameworks, or ve-hicle detection
detec-• Linguistics: Neural networks also have an essential role in machine translationsystems This application broke down the language barrier between countriesand improving human communication People can easily express themselvesand understand others by simply typing in translation apps using a neuralnetwork
• Robotics: Thanks to the development of the neural network, some professionsnowadays can be replaced by robots to reduce risk and effort for humans Someexcellent iconic examples are autonomous house cleaning robots, firefighterrobots, or even surgical robots
Numerous applications can also be found specifically in the context of object ing For example, systems such as vehicle monitoring, customer behavior analysis,security surveillance cameras using integrated multiple objects tracking feature caneasily increase performance and reduce human effort, which is efficient guaranteed.Traditionally, vehicle tracking systems used GPS to determine location Thesesystems are mostly user-based, required the physical installation of a box into thevehicle, which is time-consuming yet has no contribution to the monitor systems
track-By applying deep learning methods, this process became significantly improveddue to their flexibility
MOT algorithms can be approximately classified into two distinctive groups:detection free and detection based tracking The first group does not need to rely
on an object detector to present target detection, while the second one does Thebiggest advantage of the first approach is independence despite the detector type
Trang 18and its performances This approach allows various general applications of thetracker to belong to many kinds of objects such as people, animals, cars, cells, etc.
In contrast, the “tracking-by-detection” or detection based tracking approach ismostly specialized in tracking one given object What the second group lacks ingenerality, it makes up for in practicality in real-life applications This tendencyseems to be the most popular approach for two main reasons First, new objects areeasily discovered, and disappearing objects are terminated automatically Secondly,object detection has endorsed huge improvements in the last recent years EachMOT algorithm consists of two major components:
1 An observation model measures the similarity between tracked objects in pastframes and detected targets in a new frame through appearance, motion, andinteraction cues
2 A dynamic model receives the similarity matrix from the observation model asinput and studies the behavior of tracked objects over time (appearance overdisappearance of certain entities, tracking over time of the others)
The observation model was simplified by finding similarities using ance, motion, or interaction between objects On the other hand, the appearancemodel computes the similarity between two observations at different times anddescribes an object’s visual representation Yet, it is not necessarily sufficient todiscriminate between different observations A typical case would be two pedestri-ans with similar clothing in different locations in consecutive frames In contrast,the motion model describes how an object is moving: a pedestrian can be static(for example, standing at a traffic light), walking with constant speed in a given di-rection or, walking around a corner, accelerating or decelerating Then, the modelcan predict the possible positions of a pedestrian in the future frames, helping todistinguish between similar appearances, but does not consider the influences ofother objects The interaction model reproduces the influences between differentobjects For instance, a pedestrian in a group would follow the group movement,
appear-or a singular pedestrian would adapt his speed appear-or trajectappear-ory to avoid collision withothers On the other hand, the main role of the dynamic model is to find the
’optimal’ sequence for each detected object, for example, its track, using either allframes (so-called ’offline’ methods) or only the frames up to the last frame observed(so-called ’online’ methods) Two approaches exist to determine this sequence, theprobabilistic inference, mainly used in online algorithms, and the deterministic op-timization, mainly used in offline algorithms When the first approach estimates
Trang 19the most probable state (size, position, velocity, etc.)of an object using tion from previous observations, the second approach tries to assign the optimalsolution to all tracked objects.
2.3.1 Limitation of existed MOT methods
To solve multi-object tracking problem, some existing methods such as SORT[9] or POI [10] addressed two separate models: First, the detection model respon-sible for localizing objects by bounding boxes and second, the model which extractre-identification features for tracking objects based on those boxes This approachcan be treated as multi-task learning since it requires training two homogeneousarchitectures for two different tasks Another solution is by using one-shot track-ers which estimate objects and learn re-ID features using a single network such
as Track R-CNN [11] and JDE [12] This architecture mostly an anchor-basedframework, which is not suitable for learning re-ID features, and as a result, the
ID of objects can be switched despite good detection results For example, TrackR-CNN has a step-wise operation that first determines boxes of objects and thenpools re-ID from these proposals As a result, the quality of re-ID features de-pends heavily on the quality of proposals and is not somewhat learned throughoutthe framework Most anchor-based methods used ROI-Pool or ROI-align repre-sent bounding boxes, which raised the unfairness caused by anchors To solve thisproblem, FairMOT only extracts re-ID features at the center of the object
Figure 2.4: Comparison between different MOT methods (a) One anchor contains multiple identities (b)
Multiple anchors response for one identity (c) One point for one identity.
Trang 20Commonly, object detection requires abstract and profound features, while
re-ID focuses on low-level features to determine details of the same class Nevertheless,for most one-shot trackers, these features are shared between two tasks, leading
to unfairness FairMOT used multi-layer feature aggregation to solve two tasks’inconsistency, allowing them to extract features freely Additionally, FairMOT alsoused multi-layer fusion to avoid biases toward the object detection branch, whichgenerates low-quality re-ID features The imbalance between feature dimensionscan cause another unfairness Most of the existed one-shot trackers used highdimensional re-ID features, damaging the object detection accuracy due to thecompetition between two tasks This problem also raised the risk of over-fittingwhen dealing with small training data To reduce the risk, FairMOT proposedusing low dimensional re-ID features, improving the inference speed and balancingbetween two tasks
2.3.2 Network architecture
The network structure of FairMOT is simple It consists of two homogeneousbranches for object detection and re-ID feature extraction instead of using thecascade model as previously mentioned methods An anchor-free style is appliedfor the detection branch, estimating object centers and sizes On the other hand,the re-ID branch responsible for identifying the object-centered at the pixel based
on extracted features By using this approach, FairMOT removed the unfairnessbetween two tasks and produced high-quality re-ID features which resulted in agood trade-off
Figure 2.5: FairMOT architecture.
FairMOT operates on high-resolution maps of strides four instead of stride 32 asprevious methods The tracking task’s accuracy is significantly improved due tothe elimination of anchors and using high-resolution feature maps for aligning re-IDfeatures to object centers Furthermore, the dimension of re-ID features is fixed to
be only 64, which improves tracking robustness and reduces a major computationtime For the backbone, FairMOT used a combined version of ResNet-34 with
Trang 21Deep Layer Aggregation (DLA) [13] to fuse multi-layer features, naming DLA-34.
As the size of image denoted as H image x W image, the output map has the shape of
H x W x C where H = H image / 4 and W = W image /4
Originally, detection branch was built based on CenterNet [13], producingheatmaps, object center offsets and bounding box sizes by applying three parallelheads The heatmap head estimate the locations of object centers based on repre-sentation while box offset and size heads aim to localize and compute the proportion
of objects Denoting each ground truth box in the image as b i = (x i
2 The location of each
object then divided by the stride of 4 (˜c i
where ˆM is the estimated heatmap and α, β are the pre-determined
parame-ters in focal loss Denoting the size and offset heads as ˆS and ˆO, for each ground
Trang 22where w1, w2 are learnable parameters used for balancing between two tasks
and L detection = L heat + L box
In May 2019, a paper called "EfficientNet: Rethinking Model Scaling forConvolutional Neural Networks" [14] published by two engineers from the Googlebrain team named Mingxing Tan and Quoc V Le This publication’s core ideawas about strategically scaling deep neural networks and introducing a new family
of neural nets called EfficientNets Efficient as the name suggested, they are verymuch computationally when achieving a state of the art result on ImageNet dataset
[15] which is 84.4% top-1 accuracy There are three scaling dimensions of a CNN:
depth , width, and resolution Depth means how deep the network is, which is
equivalent to the number of layers in it, and it is the most popular scaling amongstall Width means how wide the network is One measure of width, for example, isthe number of channels in a Convolutional layer whereas resolution is simply theimage resolution that is being passed to a CNN
In detail, depth can be scaled up and scaled-down by adding/removing layers,
respectively, based on the intuition that a deeper network can capture more perous and more intricate features and generalizes well on new tasks Nevertheless,practically, it does not improve performance despite theoretically, with more layers,the network performance should follow One explanation for this problem can be
pros-caused by V anishing gradients when going more in-depth Even applying some
techniques to make the training smooth and avoid the gradients to vanish adds
more layers and does not always help in every scenario The width scaling is
com-monly used when the goal is to keep the model small and easy to train Widernetworks are used to capture more fine-grained features, but in contrast, the ac-curacy is quickly saturated with shallow models Intuitively, in a high-resolutionimage, the features are more fine-grained, and hence high-resolution images shouldwork better This is why the image resolutions in complex tasks such as Objectdetection are 300x300, or 512x512, or 600x600 However, this does not scale lin-early By these observations, the scaling method introduced in the paper is named
compound scaling and suggests that instead of scaling only one model attributeout of depth, width, and resolution, strategically scaling all three of them togetherdelivers better results
Trang 23Figure 2.6: Model Scaling.
Compound scaling method uses a compound co-efficient ϕ to scale width,
depth, and resolution together by following formula:
• Depth: d = α ϕ Width: w = β ϕ Resolution: r = γ ϕ
• Subject to: α ϕ β ϕ γ ϕ ≈ 2
• α ≥ 1, β ≥ 1, γ ≥ 1
ϕ is a user-specific co-efficient that takes real numbers like and controls resources,
which is 2ϕ The EfficientNet-B0 architecture was not developed by engineers but
by the neural network itself They developed this model using a multi-objectiveneural architecture search that optimizes both accuracy and floating-point oper-ations Taking B0 as a baseline model, the authors developed a full family ofEfficientNets from B1 to B7, which achieved the state of the art accuracy on Ima-geNet while being very efficient to its competitors
ReLu works pretty well, but it got a problem, it nullifies negative values, andthus, derivatives are zero for all negative values There are many known alternatives
to tackle this problem, like leaky ReLu, Elu, Selu, etc However, none of them hasproven consistent Google Brain team suggested a newer activation that tends towork better for deeper networks than ReLU, a Swish activation
Swish (x) = x ∗ sigmoid(x) (2.7)
Trang 242.5 Chapter summary
In this chapter, an overview of theories and knowledge behind deep learningconcept has been delivered It is undeniable that the ability to learn from unstruc-tured data of deep learning is an enormous benefit for those interested in real-worldapplications Alongside, this chapter also analyzed two models: FairMOT for mul-tiple objects tracking and EfficientNet for object classification which inspired thisreport Authors of these two articles have delivered their insights about differentproblems, they also shared detail experiments with specific empirical results whichwas extremely trustworthy The next chapter will focus on solving vehicle track-ing and classification problem by applying these theories to propose a completepipeline
Trang 25CHPATER 3 THE METHOD
3.1.1 Data collection
Some of the available datasets are provided by well-known challenges such asMOT benchmark [16], or KITTI benchmark [17], which mostly focus on pedestriantracking instead of vehicle tracking To solve this problem, we at UET AILAB havecreated our dataset consists of videos extracted from surveillance cameras in VietNam These cameras have one or two view directions and focus on intersectionswhere vehicles’ density is largest Videos are recorded in 720p resolution with onehour in length, spanning multiple continuous days
Figure 3.1: Example of dataset.
The dataset comes with ground truth files represents classes, coordinates,tracking id, directions, and several objects Collected cameras mostly targetingintersections since these locations usually have a large number of vehicle density.There are in total twelve possible directions according to the combination of fourlanes, defined as follow:
Trang 26• Direction 1: Up - Down.
• Direction 2: Up - Right
• Direction 3: Up - Left
• Direction 4: Down - Up
• Direction 5: Down - Right
• Direction 6: Down - Left
• Direction 7: Right - Left
• Direction 8: Right - Up
• Direction 9: Right - Down
• Direction 10: Left - Right
• Direction 11: Left - Down
• Direction 12: Left - Up
Figure 3.2: 12 possible directions.
These collected videos are entirely raw and unprocessed Since the period acrossmultiple days, it also suffers from the surrounding environment and the lightingcondition As the figure of the example video shows, at night, the quality ofimages extracted from these cameras is heavy exposure to the street light, and just
a single view angle does not fully cover some areas One of the significant problemswhen handling multiple objects tracking by surveillance cameras is object occlusion