2.1.3 CNN Layers A typical CNN structure consists of several building blocks, or layers: an input layer, a convolutional layer, an active layer, a pooling layer, a fully connected layer
Trang 1COMPUTER SCIENCE PROJECT
VIDEO-BASED PARKING SPACE
DETECTION
Major: Computer Science
THESIS COMMITTEE: CLC KHMT 2 SUPERVISOR(s): NGUYỄN THANH BÌNH
STUDENT: NGUYỄN TẤN TÀI (1852725)
HO CHI MINH CITY, 2/2023 (9/1/2023)
Trang 2HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING
COMPUTER SCIENCE PROJECT
VIDEO-BASED PARKING SPACE
DETECTION
Major: Computer Science
THESIS COMMITTEE: CLC KHMT 2 SUPERVISOR(s): NGUYỄN THANH BÌNH
STUDENT: NGUYỄN TẤN TÀI (1852725)
HO CHI MINH CITY, 2/2023 (9/1/2023)
Trang 3KHOA: KH & KT Máy tính _ NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
BỘ MÔN: HTTT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
Họ và tên SV:
Nguyễn Tấn Tài – 1852725
Ngành (chuyên ngành): Khoa học Máy Tính
1 Đầu đề luận án:
VIDEO-BASED PARKING SPACE DETECTION
2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
- Tìm hiểu các định dạng ảnh video
- Tìm hiểu các tập dữ liệu video bãi gởi xe
- Tìm hiểu các công trình nghiên cứu liên quan và ưu nhược điểm của chúng
- Nghiên cứu các đặc trưng chỗ đậu xe
- Đề xuất phương pháp thực hiện phân tích, detection
- Hiện thực phương pháp đề xuất và thử trên tập dữ liệu chuẩn
- So sánh với các phương pháp khác
- Đánh giá giải thuật
3 Ngày giao nhiệm vụ luận án: 10/08/2022
4 Ngày hoàn thành nhiệm vụ: 15/12/2022
5 Họ tên giảng viên hướng dẫn: PGS.TS Nguyễn Thanh Bình
Nội dung và yêu cầu LVTN đã được thông qua Bộ môn
Ngày …… tháng…… năm 2022
(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)
PGS.TS Trần Minh Quang Nguyễn Thanh Bình
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):
Trang 6Declaration of Authenticity
I declare that this research is my own work, conducted under the supervision and guidance of Assoc Prof Nguyen Thanh Binh The result of my research is legitimate and has not been published in any forms prior to this All materials used within this research are collected myself by various sources and are appropriately listed in the references section
In addition, within this research, I also used the results of several other authors and organizations They have all been aptly referenced
In any case of plagiarism, I stand by my actions and will be responsible for it Ho Chi Minh city University of Technology therefore are not responsible for any copyright infringements conducted within my research
Trang 8Contents
1.1 Problem Statement 7
1.2 Goals 7
1.3 Limitations 8
1.4 Thesis Structure 8
2 Theoretical Background 8 2.1 Theoretical Knowledge 8
2.1.1 Object Detection 8
2.1.2 Convolution Neural Network (CNN) 9
2.1.3 CNN Layers 10
2.2 Related Work 14
2.2.1 CNN-based detection algorithms 14
2.2.2 Two-stage detection algorithms 14
2.2.3 One-stage detection algorithms 19
3 Proposed Methods 21 3.1 The Problem 21
3.2 Proposed Solution 22
3.2.1 YOLOv7 22
3.2.2 Model re-parameterization 23
3.2.3 Model Scaling 24
3.2.4 Model architecture 24
3.2.5 Trainable bag-of-freebies 27
Trang 93.3 Evaluation Methods 29
3.3.1 Intersection over Union (IoU) 29
3.3.2 True Positive, False Positive, False Negative, True Negative 30
3.3.3 Precision, Recall 30
3.3.4 F1 score 31
3.3.5 Precision - Recall curve 31
3.3.6 Average Precision 31
3.3.7 Mean Average Precision 32
4 Project Results 32 4.1 Hardware and Dataset 32
4.1.1 Google Colaboratory 32
4.1.2 Dataset selection and annotation 33
4.2 Research and Evaluation 34
4.2.1 Training parameters 34
4.2.2 Training metrics 34
4.2.3 Testing on images 37
4.2.4 Testing on sample video footage 38
5 Summary 39 5.1 Final Result 39
5.2 Advantages and Disadvantages 39
5.3 Future Plans 40
Trang 10List of Figures
2.1 A road map of object detection with various milestones [2] 9
2.2 An example of a CNN structure [7] 10
2.3 An illustration of the convolution operation [7] 11
2.4 Convolution operation [8] 12
2.5 Max Pooling operation with 2x2 filters and stride 2 [7] 13
2.6 Non-linear operations plots visualization [10] 14
2.7 R-CNN visualization [13] 15
2.8 SPPNet visualization [13] 15
2.9 Fast R-CNN visualization [13] 16
2.10 Faster R-CNN visualization [13] 17
2.11 Feature Pyramid Network visualization [13] 18
2.12 YOLO visualization [13] 19
2.13 SSD visualization [13] 20
2.14 RetinaNet visualization [23] 21
3.1 Comparison of YOLOv7 with other real-time object detectors [27] 22
3.2 YOLOv7 architecture [27] 23
3.3 Extended efficient layer aggregation networks [27] 25
3.4 Model scaling for concatenation-based models [27] 26
3.5 RepConv being used in VGG [27] 27
3.6 Planned re-parameterized model [27] 27
3.7 Coarse for auxiliary and fine for lead head label assigner [27] 28
3.8 Intersection over Union [28] 29
4.1 Hardware specifications of Google Colaboratory 32
Trang 114.2 Various sample images from the PKLot dataset, annotated 33
4.3 Confusion matrix 34
4.4 Precision curve 35
4.5 Recall curve 35
4.6 F1 curve 36
4.7 Precision Recall (PR) curve 36
4.8 Mean average precision with 0.5 IoU value (left) and between 0.5 and 0.95 IoU value (right) 37
4.9 Test image 1 37
4.10 Test image 2 38
4.11 Test image 3 38
4.12 Video footage test 1 39
4.13 Video footage test 2 39
Trang 121 Introduction
1.1 Problem Statement
Getting an available parking spot is a problem faced my many car owners, especially in developing modern cities Sometimes, and most of the time, there are vacant spots, but the drivers do not have any information about them It could be, either a free spot far away from them, or is it hidden by some other cars or any other objects big enough to hide the spot In some cases, parking spaces are managed
by people such as security guards who might not have the total view of the next available parking space Sometimes the driver themselves has to check for a vacant space by circling around the parking lot, and there is the problem of another driver would come and occupied said slot, thus many loss are generated: time, fuel, and maybe temper
In developing modern cities, urban planning does not follow the quick growth of popula- tion dynamics It implies that newly brought vehicles between two urban planning imple- mentations, which might not be accommodated in all the existing parking facilities It leads to bad management
of the space by drivers and congestion in the parking lot, especially at peak hours, as drivers are stuck not knowing where to go next A study by INRIX found that the average American driver spends 17 hours a year looking for a parking spot That search costs each driver around $345 in wasted time, gas, and emissions In larger cities, drivers spend even more time looking for spots [1]
1.2 Goals
Due to such nature of finding a vacant parking slot, a system to detect vacant spaces is desirable
to route drivers efficiently to proper empty spots
In order to develop such system, this project’s objectives are the following:
• To review theoretical knowledge and related works regarding systems to detect parking space occupancy, applying them to our understanding of the problem and methods to resolve said problem
• To design a model which can efficiently detect the occupancy status of the parking space
Trang 13in an image and a video using our proposed method
1.3 Limitations
In the vision-based approach using machine learning models, the model predicts and checks the image and video based on the information from the training data, which is from the camera point of view at the testing time Unless a few images of the testing environment are included in the training set, the model may not obtain the same detection result compare to normally Furthermore, the precision values may change according to the parking space conditions such as lighting, weather and parking space arrangements
1.4 Thesis Structure
This paper is organized as follows:
• Chapter 1: Introduction - A brief introduction about the objectives of the thesis
• Chapter 2: Theoretical Background - An introduction of theoretical background and related works as foundation knowledge which are applied in the project
• Chapter 3: Proposed Methods - Proposed methods and evaluation of said methods to solve the project theoretical problem
• Chapter 4: Project Results - The execution of the selected solution, result and evaluation of said solution
• Chapter 5: Summary - A summary of the final results and future plan
”What objects are where?”
Trang 14Different strategies have been proposed to solve the problem of object detection through- out the years In the past two decades, it is widely accepted that the progress of object detection has generally gone through two historical periods: ”traditional object detection period (before 2014)” and ”deep learning based detection period (after 2014)”, as shown in Figure 2.1 [2]
Figure 2.1: A road map of object detection with various milestones [2]
In 2012, Convolution Neural Network (CNN) was created [3] Due to its ability to learn robust and high-level feature representations of an image, modern object detection algo- rithms utilized them and started to evolve at an incredible rate, with optimization focused algorithms such as VGGNet [4], GoogLeNet [5] and Deep Residual Learning (ResNet) [6] have been invented over the years
2.1.2 Convolution Neural Network (CNN)
A Convolutional Neural Networks (CNN) is a subclass of artificial neural networks that specialize
in processing data that has a grid-like topology, such as an image A digital image is a binary representation of a visual data which contains a series of pixels arranged in a grid that contains its own pixel values to denote how bright and what color each pixel should be
The human brain processes a huge amount of information the second we see an image
Trang 15Each neuron works in its own receptive field and is connected to other neurons in a way that they cover the entire visual field Just as each neuron responds to stimuli only in the restricted region of the visual field called the receptive field in the biological vision systems, each neuron in a CNN processes data only in its receptive field as well The layers are arranged in such a way so that they detect simpler patterns first (lines, curves, etc.) and more complex patterns (faces, objects, etc.) further along
2.1.3 CNN Layers
A typical CNN structure consists of several building blocks, or layers: an input layer, a convolutional layer, an active layer, a pooling layer, a fully connected layer and finally, an output layer Some types of CNN models might include other layers for different purposes
Figure 2.2: An example of a CNN structure [7]
This multi-layered structure is diverse in layers and uses forward pass and error back- propagation calculations to achieve the target’s proficiency Training this CNN to become a model
is a directed procedure that requires a collection of imagery data and their labels Eventually, at the end of the training process, the most suitable weights would be calculated to be used at the testing phase
Of all the layers in the CNN structure, the three most important layers are: convolution layers, pooling layers and fully connected layers, which will be further explained as follows
Trang 16Figure 2.3: An illustration of the convolution operation [7]
During the forward pass, the kernel slides across the height and width of the image- producing the image representation of that receptive region This produces a two-dimensional representation of the image known as an activation map that gives the response of the kernel at each spatial position of the image The sliding size of the kernel is called a stride
If we have an input of size W ×W ×D and D out number of kernels with a spatial size of F with stride
S and amount of padding P, then the size of output volume can be determined by the following
Trang 17Pooling layer
Figure 2.4: Convolution operation [8]
The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs This helps in reducing the spatial size of the rep- resentation, which decreases the required amount of computation and weights The pooling operation is processed on every slice of the representation individually
There are several pooling functions such as the average of the rectangular neighborhood, L2 norm
of the rectangular neighborhood, and a weighted average based on the distance from the central pixel However, the most popular process is max pooling, which reports the maximum output from the neighborhood
Trang 18Figure 2.5: Max Pooling operation with 2x2 filters and stride 2 [7]
If we have an activation map of size W × W × D, a pooling kernel of spatial size F, and stride S, then
the size of output volume can be determined by the following formula:
W out = W − F + 1
S
This will yield an output volume of size W out × W out × D out
In all cases, pooling provides some translation invariance which means that an object would be recognizable regardless of where it appears on the frame
Fully connected layer
Neurons in this layer have full connectivity with all neurons in the preceding and succeeding layer as seen in regular F-CNN This is why it can be computed as usual by a matrix multiplication followed
Trang 19• Sigmoid - the Sigmoid non-linearity has the mathematical form σ(κ) = 1 / (1 + e κ) and takes a real-valued number and ”squashes” it into a range between 0 and 1
• Tanh - similar to Sigmoid, but Tanh squashes a real-valued to the range between -1 and 1 instead
• Rectified Linear Unit (ReLU) - has become very popular in the last few years It computes
the function f (κ) = max(0, κ) In other words, the activation is simply threshold at zero
Figure 2.6: Non-linear operations plots visualization [10]
2.2 Related Work
2.2.1 CNN-based detection algorithms
In deep learning era (Figure 2.1), object detection algorithms can be grouped into two types:
”two-stage detection” and ”one-stage detection”, where the former frames the detection as a to-fine” process while the later frames it as to ”complete in one step”
”coarse-2.2.2 Two-stage detection algorithms
R-CNN
R-CNN [11] is simple to understand: starts with the extraction of a set of object proposals by selective search [12] Then each proposal is rescaled to a fixed size image and fed into a pre-trained CNN model to extract features Finally, linear support vector machine (SVM) classifiers are used to predict the presence of an object within reach region and to recognize object categories
Trang 20Figure 2.7: R-CNN visualization [13]
Although R-CNN has made great progress, its drawbacks are obvious: the redundant feature computations on overlapped proposals leads to extreme slow detection speed Also, the selective search algorithm is a fixed algorithm Therefore, no learning is happening at that stage This could lead to the generation of bad candidate region proposals Later in the same year, SPPNet [14] was proposed and has overcome this problem
SPPNet
In 2014, Spartial Pyramid Pooling Networks (SPPNet) [14] was proposed Conventionally, at the transition of convolution layer and fully connected layer, there is one single pooling layer or even no pooling layer In SPPNet, it suggests to have multiple pooling layers with different scales Also previous CNN models require a fixed-size input The Spatial Pyramid Pooling (SPP) layer in SPPNet enables a CNN to generate a fixed-length representation regardless of the size of image/region of interest without rescaling it
Figure 2.8: SPPNet visualization [13]
Trang 21Figure 2.8 illustrates the process: we can see that the input image goes to SPPNet using convolution network only once Selective search is used to generate region proposals just like in R-CNN
At the last convolution layer, feature maps bounded by each region proposal is going into the SPP layer then the FC layer
Although SPPNet has effectively improved the detection speed, there are still some draw- backs: first, the training is still multi-stage, second, SPPNet only fine-tunes its fully connected layers while simply ignores all previous layers Later in the next year, Fast R-CNN
[15] was proposed and solved these problems
Fast R-CNN
In 2015, R Girshrick proposed Fast CNN detector [15] which is a further improvement of CNN and SPPNet Compared to an R-CNN model, a Fast R-CNN model uses the entire image as the CNN input for feature extraction, rather than each proposed region Selective search is applied on the
R-image and suppose it generates n proposed regions, their different shapes indicate regions of interests
(RoIs) of different shapes Fast R-CNN introduces RoI pooling, which uses the CNN output and RoIs as input to output a concatenation of the features extracted from each proposed region and fed into a fully connected layer During category prediction, the shape of the fully connected layer output
is again transformed to n × q and we use softmax regression (q is the number of categories and n is
the number of proposed regions) During bounding box prediction, the shape of the fully connected
layer output is again transformed to n× 4 This means that we predict the category and bounding box
for each proposed region
Figure 2.9: Fast R-CNN visualization [13]
Trang 22The reason “Fast R-CNN” is faster than R-CNN is because we don’t have to feed all
region proposals to the convolutional neural network every time Instead, the convolution operation
is done only once per image and a feature map is generated from it
Although Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by the proposal detection Naturally, a question arises: ”Can we generate object proposals with a CNN model?” Later, Faster R-CNN [16] has answered this question
Faster R-CNN
In 2015, S Ren et al proposed Faster R-CNN detector [16] shortly after the Fast R-CNN It is the first end-to-end, and the first near-realtime deep learning object detector All of the above algorithms(R-CNN [11], SPPNet [14] Fast R-CNN [15]) uses selective search to find out the region proposals Selective search is a slow and time-consuming process affecting the performance of the network Faster R-CNN eliminates the selective search algorithm and lets the network learn the region proposals
Figure 2.10: Faster R-CNN visualization [13]
Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map Instead of using selective search algorithm on the feature map
to identify the region proposals, a separate network is used to predict the region proposals The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes
As a part of the Faster R-CNN model, the region proposal network is trained together with the rest of the model In addition, the Faster R-CNN objective function includes the
Trang 23category and bounding box predictions in object detection, as well as the category and bounding box predictions for the anchor boxes in the region proposal network Finally, the region proposal network can learn how to generate high-quality proposed regions, which reduces the number of proposed regions while maintaining the precision of object detection
Although Faster RCNN breaks through the speed bottleneck of Fast RCNN, there is still computation redundancy at subsequent detection stage Later, a variety of improvements have been proposed, including RFCN [17] and Light head RCNN [18]
Feature Pyramid Networks
In 2017, T.-Y Lin et al proposed Feature Pyramid Networks [19] If we dig into Faster R-CNN,
we see that it is mostly unable to catch small objects in the image To solve this a simple image pyramid can be used to scale image to different sizes and send it to the network Once the detections are detected on each scale, all the predictions can be combined using different methods
Figure 2.11: Feature Pyramid Network visualization [13]
Before FPN, most of the deep learning based detectors run detection only on a network’s top layer Although the features in deeper layers of a CNN are beneficial for category recognition, it is not conducive to localizing objects A top-down architecture with lateral connections is developed in FPN for building high-level semantics at all scales Since a CNN naturally forms a feature pyramid through its forward propagation, the FPN shows great advances for detecting objects with a wide variety of scales
FPN has now become a basic building block of many latest detectors