One way to implement ML is throughneural networks NN, which are part of what is referred to as deep learning DL.In order to train and classify data, deep learning DL refers to the use of
Trang 1HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
GRADUATION THESIS
DESIGN AN EFFICIENT FPGA-BASED
ACCELERATOR FOR REAL-TIME PARKING
OCCUPANCY DETECTION
Major: COMPUTER ENGINEERING
THESIS COMMITTEE : COMPUTER ENGINEERING
SUPERVISOR(s) : ASSOC PROF DR TRAN NGOC THINH
MR HUYNH PHUC NGHI MEMBER SECRETARY: ASSOC PROF DR PHAM QUOC CUONG
-o0o - STUDENT 1 : NGUYEN VU THANH NGUYEN - 1652437
HO CHI MINH CITY, 01/2023
Trang 2TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA: KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
BỘ MÔN: KT Máy tính Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
HỌ VÀ TÊN: Nguyễn Vũ Thành Nguyễn MSSV: 1652437
NGÀNH: Kỹ thuật Máy tính LỚP:
1 Đầu đề luận án:
Design an efficient FPGA-based accelerator for real-time parking occupancy detection
2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
• Research on a BNN approach in image classification task with CNRPark dataset
• Implementation of an encoder for input images and a set of weight parameters for
parking solutions
• Build and evaluate a hardware accelerator run on Ultra96v2 SoC with a pre-encoded
32x32 image set
3 Ngày giao nhiệm vụ luận án: 19/09/2022
4 Ngày hoàn thành nhiệm vụ: 09/01/2023
1) PGS TS Trần Ngọc Thịnh
2) KS Huỳnh Phúc Nghị
Nội dung và yêu cầu LVTN đã được thông qua Bộ môn
Ngày tháng năm 2023
(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):
Trang 5I hereby declare that I worked on and is the sole author of this bachelor thesisand that I have not used any sources other than those listed in the bibliographyand identified as references Other than that, the work presented is entirely myown.
Nguyen Vu Thanh Nguyen
Trang 6This thesis has completely reached its result thanks to the continuous efforts
of myself, the support and encouragement of our lecturers, friends and families Iwould like to express our sincere attitude to those who have helped us throughoutthe study, research and during working on the thesis
I would first like to express my sincere gratitude to my supervisors, AssocProfessor Tran Ngoc Thinh and BEng Huynh Phuc Nghi have consistently helped
me out by providing me with not only the necessary tools to complete the thesisbut also with a wealth of information and direction so that I could move in thebest route They never stopped inspiring me and provided me the chance toparticipate in a really fascinating work that was built using a lot of the information
I had acquired during our time at university They have always been a kind,understanding teacher who encourages me to make changes to the work and thisthesis as needed Working with them and gaining expertise under their guidancewas an honor for me
In addition to thanking the supervisors, I also like to acknowledge thecouncilors at the thesis defense for their wise criticism and suggestions that helped
Finally, I would like to thank my parents for always providing a positiveenvironment for our development and for supporting me when I face difficulties inboth our academic and personal live
Nguyen Vu Thanh Nguyen
Trang 7This thesis proposes, studies and examines an approach on developing anedge-ai smart parking solution, including hardware and software components byimplementation of the FracBNN+CNRPark model with hardware acceleration onthe Ultra96-V2 board This approach allows end-users to be able to monitor anddetect busy and free parking spaces automatically via security cameras The imageclassification model runs entirely on the edge, on the Ultra96-V2 board, withoutthe help of a server workstation.
Trang 8Commitment 2
1.1 Purpose and Motivation 1
1.2 Scope and Objectives 3
1.2.1 Problem Statements 3
1.2.2 Objectives 3
1.3 Structure of Thesis 4
2 Background knowledge and Terminology 5 2.1 Software - Artificial Intelligence 5
2.1.1 The development of AI 5
2.1.2 Neural Network (NN) 6
2.1.3 Convolutional Neural Network (CNN) 11
2.1.4 Binary Neural Network (BNN) 15
Trang 92.2 Smart Parking concepts 16
2.2.1 Smart Parking 16
2.2.2 Edge AI 17
2.3 Hardware and constraints 18
2.3.1 FPGA and SoC 18
2.3.2 GPU 21
2.4 Tools and Frameworks 22
2.4.1 Pytorch 22
2.4.2 Vivado and Vivado HLS 23
2.4.3 PYNQ and BNN-PYNQ 24
2.5 Performance Criteria 25
2.5.1 Recall, Precision & F1-score 25
2.5.2 Average IoU & mean Average Accuracy (mAP) 26
2.5.3 Power consumption 27
2.5.3.1 FPS & Latency 27
3 Literature Review 29 3.1 Smart Parking Related works 29
3.1.1 International Solutions 29
3.1.1.1 Moscow Parking (Source: https://parking.mos.ru/) 29 3.1.1.2 SENSIT 30
3.1.1.3 SFPark 30
3.1.1.4 Cisco Smart+Connected City Parking 30
3.1.2 Solutions in Vietnam 31
3.1.2.1 My Parking 31
3.1.2.2 IParking 31
3.1.3 Smart Parking systems with Image Processing 31
3.2 Previous Group’s Thesis Result 33
3.3 Vien’s Thesis 34
Trang 103.3.0.1 License Plate Dataset 35
3.3.0.2 YOLOv3 Object Detection Model 35
3.3.0.3 Implementing Vien’s approach 38
3.3.0.4 Comparing YOLOv3 on Ultra96-V2 and JetsonNano 40 3.4 The Original BNN Model 41
3.5 Improved BNN Models 43
3.6 FracBNN (Dec 2020) 45
4 Methodology 49 4.1 The proposed solution - Previous thesis group 49
4.2 FracBNN+CNRPark 49
4.2.1 Model Architecture 50
4.2.2 Hardware Accelerator Architecture 52
4.3 Dataset 54
4.3.1 CNRPark+EXT 54
5 Implementation 56 5.1 FracBNN+CNRPark model on Pytorch 56
5.1.1 Training the model 56
5.1.1.1 Training dataset 56
5.1.1.2 Training routine 58
5.2 Hardware Acceleration on Ultra96-V2 60
5.2.1 Weights and Bias processing 61
5.2.2 Thermometer Encoding 63
5.2.3 Building model on Vivado HLS and Vivado 64
5.2.4 Inference on the Ultra96-V2 66
5.3 Evaluation 66
5.3.1 Training result 66
5.3.2 Hardware acceleration result 67
Trang 116 Conclusion 69
6.1 Summary 69
6.1.1 Comparing to Vien’s thesis 69
6.1.2 Trained and implemented FracBNN+CNRPark model on Pytorch, run on GPU 70
6.1.3 Implemented hardware acceleration via Vivado HLS on Ul-tra96v2 70
6.1.4 Achieve inference result using built thermometer encoder on Ultra96v2 71
6.2 Future Improvements 71
6.2.1 Accuracy degredation 71
6.2.2 Real-time application 72
Trang 122.1 Development timeline of AI, ML and DL [1] 6
2.2 a Neural Network with 4 layers 7
2.3 Composition of a hidden layer on the CNN 8
2.4 Five commonly used activation functions: (a) binary step function, (b) sigmoid function, (c) tanh function, (d) ReLU function, (e) leaky ReLU function.[1] 9
2.5 Structure of a CNN The data is run through several convolutional and pooling layers learning features in the image.[1] 12
2.6 Convolutional filter size 3x3 sweeping through image size 4x4 13
2.7 Output (darkgreen) from a convolutional filter 14
2.8 Convolutional filter sweeping with padding size 1, stride of 2 14
2.9 Output (darkgreen) of max pooling (left) and average pooling (right) 15 2.10 Smart Parking Solution 17
2.11 Edge AI Workflow 18
2.12 An FPGA block diagram 19
2.13 Front view of Ultra96-V2 20
2.14 Block diagram of the Ultra96-V2 21
2.15 A PyTorch workflow 23
2.16 Calculation of IoU 26
2.17 mAP visualization 27
2.18 JetsonStats GUI 28
3.1 Vietnamese License Plate Dataset 36
Trang 133.2 YOLOv3 performance result 36
3.3 YOLOv3 box bounding technique 37
3.4 Darknet-53 38
3.5 GPU Implementation Workflow 39
3.6 Benchmark with threshold IoU 50% 40
3.7 Benchmark with threshold IoU 75% 40
3.8 A visualization of the sign layer and Straight-Through Estimator (STE) While the real values of the weights are processed by the sign function in the forward pass, the gradient of the binary weights are simply passed through to the real valued weights.[2] 42
3.9 BNN Training Curve 42
3.10 Evolution of BNN Accuracy - Source: FracBNN introduction pre-sentation slides (FPGA2021) 46
3.11 Main contributions of FracBNN - Source: FracBNN introduction presentation slides (FPGA2021) 47
3.12 Input images need to be first encoded 47
3.13 Results of binarizing the input layer using thermometer encoding on CIFAR-10 – ResNet-20 BNN has 0.27 million parameters and 40.9 million BMACs [3] 48
3.14 Improving BNN by computing an additional sparse binary convolu-tion layer.[3] 48
4.1 Previous group’s proposed solution - Smart Parking System Archi-tecture [4] 50
4.2 FracBNN+CNRPark model architecture, based on Resnet20 [5] 51
4.3 Basic blocks - green highlights are the difference to ReActNet [6] model 52
4.4 Edge solution architecture 53
4.5 FracBNN accelerator architecture[3] 54
4.6 CNRPark+EXT dataset sample 55
5.1 FracBNN Training workflow 58
5.2 Generating an FPGA accelerator from trained FracBNN 61
Trang 145.3 Thermometer encoder workflow 645.4 Vivado HLS Utilization Estimate 65
Trang 152.1 Confusion Matrix [7] 25
3.1 Energy Consumption when Inferencing CNV Model on Ultra96v2 and VGG Model on Jetson Nano[4] 34
3.2 Summary of Vien’s thesis result 35
3.3 Summary and Comparison between Ultra96v2 and Jetson Nano 40
3.4 A table of major details of the methods presented in this section 45
3.5 Comparison of accuracies on the ImageNet dataset from works presented in this section Full precision network accuracies are included for comparison as well.[8] 45
5.1 Resources utilization on Ultra96v2 65
5.2 Accuracy comparison between models of input size 32x32 67
5.3 Inference results of prospective models on Ultra96v2 68
5.4 Power consumption on Ultra96v2 of different models 68
Trang 16ASIC Application Specific Integrated Circuits
BNN Binary Neural Network
BRAM Block Random Access Memory
CNN Convolutional Neural Network
CPU Core Processing Unit
FF Flip-flop
FPGA Field Programmable Gate Array
GPU Graphics Processing Unit
MPSoC Multi-Processors System on Chip
LUT Look Up Table
PL Programmable Logic
RGB Red Green Blue
Trang 171.1 Purpose and Motivation
Machine learning became more well-known at the beginning of the first century This resulted from the need to process increasing amounts of dataand the availability of less expensive ML-capable hardware, like GPUs and RAM.Due to their inherent design, neural networks heavily rely on parallel computing to
twenty-be effective, making a GPU with its many cores the ideal tool for the job Due
to this, GPUs have been the norm for machine learning applications for the lastfew years, but new hardware has recently been introduced The Tensor ProcessingUnit, or TPU, a chip made specifically for machine learning, was unveiled byGoogle in 2016 As a way to obtain the effectiveness of a dedicated chip withoutthe restrictions of an ASIC, FPGAs have also been demonstrating promise overtime The development of machine learning has also greatly benefited from thecurrent trend of cloud computing More researchers can conduct their researchcost-effectively thanks to the availability of large GPU, TPU, or FPGA centers’computing power [1]
Each year, more and more applications for machine learning are released,signaling the field’s rapid expansion Today, we use applications on a daily basis,whether they are in our cars, phones, medical software, or almost any other hi-techproduct To be accurate and effective, machine learning requires a lot of data, and
as machine learning becomes more popular, more data is being sent across variousnetworks We send enormous amounts of data back and forth, particularly forapplications that collect data in the field, send it to a datacenter for processing,and then return the results As a result, the system that collects the data needs tofurther develop its ability to process data at the network’s edge For instance, thiscould significantly reduce the amount of data on the networks for a video stream
Trang 18that counts the number of cars on a highway The counted number of cars can besent to a datacenter whenever necessary rather than sending full scale images 30times per second Machine learning will permeate more and more aspects of ourlives, as it stands today There are countless uses for intelligent machines that canassist us in performing tasks without being micromanaged An application shouldalways perform processing as close to the edge as possible to avoid clogging ourcommunication networks with data.[1]
As Vietnam today continues to develop economically, our country has seen asteady increase in the number of cars in traffic, facilitating the needs for more, andbigger parking lots to accommodate it This has led to many new problems for bothdrivers and management, one of which is the increasing difficulty in identifying andfinding empty parking space Smart Parking emerged as a viable solution to theseissues, but this system still can be improved To further optimize and enhance itsperformance, we propose the implementation of hardware acceleration
In the semester HK192, the previous group has already devised a solution[9] for the Smart Parking problem The solution is very clever, as the usage of IoTdevices such as sensors has not been utilized due to it’s high cost and difficulty toimplement This is because for an IoT solution to work, there needs to be a sensor
in each parking space, and a very high-performance server to process all of thesignal and data of each sensor The maintenance cost is also high as the sensorsare prone to damage due to environmental factor
The solution devised is to use security cameras which already is present
in almost all parking spaces along with edge devices hub which will process theimages from the camera in order to detect vacant parking spaces The data isprocessed directly at the edge devices before sending to the server for reduced load.Besides, the group also implemented a web server, a database system along sidewith a phone application for real world usage
However, the image processing system in the edge devices are still at aprimitive stages with many issues in providing data for the Smart Parking app.Therefore, we’re tasked with improving the system, especially the AI system at theedge devices for better image detection and processing
Trang 191.2 Scope and Objectives
1.2.1 Problem Statements
The first problem of the thesis is to improve the current state of SmartParking: To detect cars at the parking spaces The system which is already inplace has implemented a Deep learning Neural Network (NN) AI system to detectthe cars classes, but it still has a lot of issues Such as low accuracy for precisenumber of cars detected at the parking slots The system is also incapable ofdetermining how many parking spaces are detected as vacant from the total ofparking spaces detected In the current state of the system, the parking spaces aredesignated manually and the system could tell if the space is vacant or not This
is an okay solution but for a Smart Parking system, parking spaces would need to
be detected automatically without human intervention
The second problemthat the thesis is trying to solve is the implementationthe said detection model of Cars and Parking space classes on to an Edge device.This implementation also has to provide real-time detection mechanism on theEdge without the assistance or processing power of the server workstation Thisputs many constraints on the NN model as it has to comply with the limitations
of the current hardware These constraints consists of processing speed, memoryand power consumption The NN model implemented has to make trade-offsfor accuracy with latency, memory consumption and power consumption of thehardware limitations
1.2.2 Objectives
Considering the trade-offs mentioned, a suitable neural network model has
to be carefully researched and implemented based on the targeted hardware Afterthe considerations during pre-thesis, I have decided that a BNN would be mostapplicable for hardware usage A BNN is a type of neural network that storesvalues of weights and activation tensors as -1 and 1 While it does have drawbacks,BNN also has many advantages over other neural network models, the greatestone being that is considerably lighter and can be run on edge devices such asMulti-Processors System on Chip (MPSoC) For this project, we aim to implementour own BNN model for the purpose of Smart Parking Implementation consists
of building, training and evaluation of the Neural Network Another goal for thethesis is to implement the above BNN model on to a suitable device on the edge
Trang 20capable of AI processing The hardware that I have chosen for this purpose is theArm-based, Xilinx Zynq UltraScale+TM MPSoC development board Ultra96v2.Another important task is to test the implemented system in a real environment
as to evaluate the solution and its usability in real life conditions based on suitablemetrics
The remainder of the thesis is organized as follows In Chapter 2, firstly,
I will review about the overall background including history of development of
AI, Convolutional Neural Network and Binary Neural Network Secondly, I alsoinclude the information about the hardware: definition of FPGA, MPSoC and
GPU; informations about the Ultra96-V2 board and JetsonNano In Chapter 4,
I went through a number of related works of BNN, FracBNN and also hardwareacceleration solutions in the past Then a proposed solution is defined, with the
architecture of the system and the approach in hardware acceleration In Chapter
5, the actual implementation of the FracBNN+CNRPark model and the embedding
on Ultra96-V2 is described In Chapter 3, the results of the various experiments
are investigated to compare the effectiveness of the FracBNN+CNRPark model, aswell as a comparison between FPGA implementation on the Ultra96-V2 and GPUimplementation on the Jetson Nano Eventually, my accomplishments, difficulties
so far are evaluated, together with rooms for improvement are proposed in Chapter
6
Trang 21"AI" refers to all machines and technologies that, on a high level, can be considered
to learn new traits, adapt to the environment, understand complex concepts, orsolve challenging problems [1]
The area of AI known as machine learning (ML) is where a machine actuallylearns about the world around us and then applies that knowledge In ML, asystem is given access to enormous amounts of data and tasked with learningfrom or improving upon the data without explicit programming There are somedistinctions among the various ML techniques In supervised learning, a trainingdataset with known labels is provided to the algorithms The system improveditself to predict the proper data class to the proper label It can then predict
a classification for new data and present a suitable label when given that data.Unsupervised learning, on the other hand, trains the system using unlabeled data.Although the system is unable to determine what the data is supposed to be andtherefore cannot produce the proper output for any new data, it does identifyrecurring patterns and structures in the provided data that can be further examined.[1]
A third approach is known as reinforcement learning, in which the system istrained through trial-and-error searching and is rewarded when the desired output
is obtained In order to maximize its performance, the system is forced to search
Trang 22for the ideal behavior in each situation One way to implement ML is throughneural networks (NN), which are part of what is referred to as deep learning (DL).
In order to train and classify data, deep learning (DL) refers to the use of multiplelayers in a neural network It is one of many low-level ML and AI implementations.[1]
Figure 2.1: Development timeline of AI, ML and DL [1]
2.1.2 Neural Network (NN)
Artificial neural networks, also known as neural networks, are modeled afterthe neuronal network in the brain, where data is processed by sending signals fromone neuron to another The neurons in a NN are organized into layers, and thelayers each process the data differently The data to be processed is always received
by an input, and the output layer always provides the output prediction Theremay be additional hidden layers between them that further process the data Thedepth of the network, as used in deep learning, is the quantity of layers in an NN.[1]
Trang 23Figure 2.2: a Neural Network with 4 layers
Each neuron in a NN functions as a sum function, taking inputs fromneurons in the layer below and adding them all up Different parameters are added
to the network to help it find features as well as inputs Since each input to aneuron has a weight W, some inputs are more important than others A biascan be included after the weighted inputs have been added together The data’sintended future use is then determined by an activation function acting on thetotal sum The input to neurons in the following layer is determined by the output
of the activation function and the weight W The parameters W and B are whatdetermine the performance of a given NN architecture where the type and number
of layers are set The formula below describes the connection between an input,weights, bias, and output of a neuron [1]
Output = φ(XN
i=0
Trang 24Figure 2.3: Composition of a hidden layer on the CNN
To decide whether a neuron should be active or not at the next layer, theactivation function is used The most basic form of activation function is a binarystep function, which produces 0 for inputs below a threshold and 1 for inputs abovethe threshold The only information provided by this kind of activation function iswhether the neuron is active or not Non-linear functions like the sigmoid or tanhfunctions are other examples of activation functions These functions normalizethe output while also providing a smooth gradient between their maximum andminimum values Rectified Linear Unit (ReLU) function and leaky ReLU functionare two additional varieties of activation functions The section below discusses theadvantages and disadvantages of each of these functions, which are mostly related
to training [1]
Trang 25Figure 2.4: Five commonly used activation functions: (a) binary step function, (b)
sigmoid function, (c) tanh function, (d) ReLU function, (e) leaky ReLU function.[1]
The training phase and the inference phase, also known as the classificationphase, are the two distinct stages that a neural network goes through whenprocessing a particular type of data These stages frequently switch places duringunsupervised and reinforced learning The network performs training, inference,parameter updates, tests with additional inference, and so forth In supervisedlearning, the training is always completed before drawing any conclusions Because
we expect the network to label similar data during inference, we feed it labeleddata during the training phase [1]
A network’s parameters, or set of weights and biases, are iteratively updatedduring training in order to produce the desired results Backpropagation and
a loss function are frequently used to accomplish this The output values fromthe network are compared to the desired value using a loss function Categoricalcrossentropy is a widely used loss function: [1]
where represents the value predicted by the network The next step is
to reduce the loss function, which provides the best possible set of parameters.Gradient descent is a popular technique for minimizing the loss function, where
Trang 26the derivate of the loss function is first examined to determine how the parametersshould be changed The selection of the activation function is crucial at this point.The sigmoid and tanh functions’ gradients approach zero for very large or verysmall values, the gradient of the binary step function is zero, and the gradient
of the ReLU function disappears for negative numbers As a result, you cannotavoid training your network by minimizing the loss function Just to counter this,the leaky ReLU has a tiny gradient for negative numbers Backpropagation isthe entire procedure of getting the output of the loss function and updating theparameters back through the network [1]
The training phase has additional parameters that govern how the rameters W and B are updated These parameters, which are referred to ashyperparameters, are set when the training process begins The learning rate,batch size, and number of epochs are the hyperparameters that are most frequentlymentioned After the gradient descent has determined which direction to change aparameter, the learning rate sets the size of the change Each iteration will updatethe parameters in very small steps if the learning rate is too low, and the trainingwill take longer The parameters’ values may diverge with an excessive learningrate Batching is frequently used to shorten training periods Multiple outputs areprocessed at once rather than each parameter being updated after each output.Thus, the batch size determines how many outputs will be batched One epoch isequal to one pass of the training dataset on the network, and the number of epochssimply controls how many times this is done [1]
pa-The trained network’s performance is set using the hyperparameters ever, there are times when the network gets trained too well on a dataset, leading
How-to longer inference times and lower accuracy It is overly focused on that dataset,which means it might no longer be able to recognize general features This is similar
to learning the word "car" and only seeing a Ferrari sportscar Even though a Fordhas four wheels and a steering wheel, the person may not immediately recognize it
as a car if they later come across one They don’t appear similar to that person, andthe Ford is not a car As previously mentioned, the hyperparameters can be altered
to avoid the parameters being overfit Dropout is a different technique that involvesremoving neurons from the network at random, effectively deactivating them Thisconstant network modification ensures that the network is not overfitted [1]
It has been demonstrated that the neural network’s performance is icantly impacted by the parameter datatype 32-bit floating point (FP32) hastraditionally dominated deep learning, but quantization techniques are now used
signif-to bring the precision down signif-to FP16, INT8, or even binary Reducing the number
Trang 27of bits used for the parameters is referred to as quantization The number range
that can be represented in FP32 is ±3.4 × 1038.Quantizing FP32 bit parametersinto, say, INT8 entails mapping all FP32 bit parameter values to the INT8 bitparameter range, which is [-128,127] Additionally, more aggressive quantization isapplied, even for parameters as small as ternary (-1, 0, 1) or binary (-1, 1) Theprecision is decreased when parameters are quantized to a smaller datatype, butthe bandwidth and memory usage are greatly reduced as a result Convolutionsand fully connected layers can be calculated for the ternary and binary case usingaddition and subtraction, which requires less computing power [1]
One of the many uses for neural networks is image processing, which isthe main objective in this thesis Image manipulation, augmentation, and dataprocessing are all included in the field of image processing Classifying what animage represents or locating objects within an image and subsequently classifyingthose objects are typical uses A neural network is used for image classification
to identify features in an image that together help determine what the imagerepresents Finding pertinent objects is the first step in object detection Typically,this is accomplished by highlighting the region where pertinent objects can befound, after which the region is examined as an image classification problem [1]
2.1.3 Convolutional Neural Network (CNN)
A convolutional neural network (CNN) is a type of neural network thatsearches through input data for features using convolutional operations Since a2D-convolution can be represented with matrix multiplication as explained below,CNNs are well-liked networks when the data is in the form of an image Toclassify an image into one of a number of predetermined outputs, they first identifyminor features in the image and then combine these features to create a morecomprehensive understanding of what the image represents [1]
Trang 28Figure 2.5: Structure of a CNN The data is run through several convolutional and
pooling layers learning features in the image.[1]
A width by height image can be thought of as a 2D matrix, with each elementdenoting the value of a corresponding pixel These numbers, which typically fallbetween 0 and 255, indicate the pixel’s intensity A single matrix is used torepresent a black-and-white image, with 0 denoting black, 255 denoting white, andgreyscale existing in the middle Three matrices—one for red, one for green, andone for blue—are stacked on top of each other to create colored RGB images Thisadds depth, the image’s third dimension The input to the CNN then changes to
a 32x32x3 matrix for a 32x32 RGB image One function is convoluted on top ofanother in the convolution formula, which is where the concept of CNNs comesfrom: [1]
f (x) ∗ g(x) =Z ∞
The input matrix and the convolution filter are two of the functions in thecase of a CNN The 2D matrix used by the convolution filter is also intended to find
a specific feature The filter outputs a feature map after sweeping the input matrix
In reality, a single convolutional layer is used to scan the input data through anumber of different feature-finding filters, resulting in a 3D matrix of the featuremap that ignores the depth of the input data
(I ∗ K) xy=Xh
w
X
Trang 29The input matrix and the convolution filter are two of the functions in thecase of a CNN The 2D matrix used by the convolution filter is also intended to find
a specific feature The filter outputs a feature map after sweeping the input matrix
In reality, a single convolutional layer is used to scan the input data through anumber of different feature-finding filters, resulting in a 3D matrix of the featuremap that ignores the depth of the input data This is an example of the application
of a convolutional filter to input data and the resulting output by the following:
Figure 2.6: Convolutional filter size 3x3 sweeping through image size 4x4
Convolution uses several filters to increase the depth of the image whiledecreasing the width and height Using reduction padding, the width and heightare avoided With padding, zero-valued elements are added to the input matrix’sborders, giving the option to achieve convolution with an output size that is equal
to or larger than the image Changing the filter’s stride length while it sweepsacross the input will also alter the size of the output [10]
Trang 30Figure 2.7: Output (darkgreen) from a convolutional filter
The pooling operation, also known as the pooling layer, is another type ofoperation An illustration of a CNN’s structure A three-channel picture is thenetwork’s input The data is put through a number of convolutional and poolinglayers that teach the image’s features Fully connected layers then classify theimage and determine the appropriate label using the learned features By groupingseveral pixels and reducing them to a single pixel in the pooling layer, the size andconcentration of the image are decreased Either the maximum value from thegroup is saved, which is known as max pooling, or the average is used, which isknown as average pooling
Figure 2.8: Convolutional filter sweeping with padding size 1, stride of 2
The data is run through a flattening layer after the convolution and poolinglayers, which turns the 3D matrix into a 1D array for the following layers One-dimensional (1D) arrays with each element represented by a neuron make up fully
Trang 31connected layers Every neuron in the layer with complete connectivity is linked toevery neuron in the layer below Applying the softmax function, which normalizesthe 1D array values into a probability function, is the final step in a CNN Thisprovides us with a valuable prediction that can be expressed as an accuracy rate.
Figure 2.9: Output (darkgreen) of max pooling (left) and average pooling (right)
There are a number of CNN architecture examples that are frequently used
or mentioned due to their influence on the field or due to their effectiveness Theseinclude AlexNet, which contains five convolutional layers and three fully connectedlayers and was the first to use ReLU-activation, and LeNet-5, which containsthree convolutional layers followed by two fully connected layers and is currentlyused as a template for stacking convolutional and fully connected layers Moreinstances such as VGG-16, which added 16 layers in all to go even further; usingblocks with multiple layers and stacking them, Inception-V1/V3/V4 creates anetwork-in-a-network structure, true to its name Another one is ResNet-50, one
of the first networks to employ batch normalization and using up to 152 layers; Acollection of networks called MobileNets that are intended for TensorFlow mobileand embedded application [10]
2.1.4 Binary Neural Network (BNN)
One of the new, prominent type of NN is the BNN, it is also called theBinarized Convolutional Neural Network as it functions nearly the same as a CNN,also with convolutional layers The parameters in binary neural networks (BNNs)are quantized to a precision of one or zero As a result of issues during training,early BNNs were not strictly BNN Since the gradient descent method on a binaryfunction prevents updating the weight in tiny steps, backpropagation could not
be used These BNNs either used different training techniques or started outwith real-valued parameters that were then quantized Later, gradient descent,backpropagation, and the use of binary values for weights and activations were made
Trang 32possible They gain from the reduction of the dot product between weights andactivations to bitwise operations The XNOR logical operation, which can be easilyimplemented in hardware, is equivalent to multiplying binary values AlthoughBNNs typically have lower accuracy than their higher precision counterparts, theyare significantly less resource and memory intensive, and their inference times aremuch shorter [1]
x b = Sign(x) =
(
+1 if x > 0
Compared to normal neural network, Binary neural network is more suitable
to be used by edge devices such as MPSoCs However, it does have drawbacks,most notable of which is that the gradient of function tends to be 0 BNN modelsare observed to be less accurate than normal neural network Still, due to itslightweight advantage and compatibility to work on edge devices, BNN is constantlybeing improved and there are methods to help improve accuracy The paper byCourbariaux et al., who were the first to present a fully binarized network, containsthe most frequently cited BNN By including a gain term to make up for theinformation that was lost during binarization, XNOR-Net enhanced the BNN.Many other highly cited networks, including DoReFa-Net, ABC-Net, and BNN+,focused on enhancing binary neural network training The use of BNNs in embeddedsystems and faster inference times are demonstrated by eBNN [1]
2.2.1 Smart Parking
The proposed solution of Smart Parking is to use a heterogeneous computingplatform to detect parking lot occupancy The solution allows the integration ofsurveillance camera systems to collect and process data for the Smart Parkingsystem, refer to Figure
Using Edge AI, the system will process directly the input from surveillancecameras on the Edge node before transferring data to server For the purpose ofthis project, a AI detection model will be implemented on the Edge Node that willdetect occupancy based on license plate detection.[4]
Trang 33Figure 2.10: Smart Parking Solution
2.2.2 Edge AI
Edge AI refers to the installation of AI software on hardware throughoutthe real world The reason it is named "edge AI" is because, as opposed tobeing done centrally in a cloud computing facility or private data center, the AIcomputation is done close to the user at the edge of the network, close to wherethe data is located The edge of the network can refer to any area because theinternet is accessible everywhere It might be a department shop, factory, hospi-tal, or one of the gadgets we see every day, like traffic lights, robots, and phones.[11]
With the help of developments in edge AI, robots and gadgets can nowfunction with the "intelligence" of human cognition wherever they may be Smartapplications with AI capabilities can learn to carry out the same activities undervarious conditions, much like in real life A variety of advantages, including real-timeinsights, lower costs, increased privacy, high availability, and persistent improve-ment, are brought about by the development of neural networks and IoT devices.[11]
Trang 34Figure 2.11: Edge AI Workflow
The inference engine in edge AI deployments operates on some sort ofcomputer or device in remote areas including factories, hospitals, automobiles,satellites, and residences The problematic data is frequently sent to the cloudwhen the AI runs into a difficulty so that the original AI model can be furthertrained before it eventually takes the place of the inference engine at the edge.When edge AI models are implemented, they continue to get more and moreintelligent thanks to this feedback loop, which significantly improves modelperformance.[11]
By recognizing and utilizing the various strengths of hardware, computerprocesses or computing operations can be sped up This is also true for applicationsinvolving machine learning, where achieving high performance inference requiresthe use of hardware The fact that binary arithmetic and matrix multiplication can
be significantly accelerated using parallel hardware is used in applications usingneural networks [1]
2.3.1 FPGA and SoC
Field Programmable Gate Arrays (FPGAs) are semiconductor devices thatcan be programmed and reprogrammed to the desired functionality or application.Even though they are typically less effective than an ASIC for any given task,they have the advantage of being reprogrammable as a design changes They werepreviously frequently used for ASIC prototyping or for lower volume designs and
Trang 35products, but they are now preferred in a wide range of applications FPGAs arecurrently used in a variety of sectors and industries, including consumer goods,aerospace, automotive, medicine, and data centers [1]
Configurable logic blocks (CLBs) arranged in a matrix and connected byprogrammable connections make up FPGAs These CLBs perform the functions oflogical simulation, and the connections determine how the CLBs are connected.The number of logic blocks that can fit inside the physical space of an FPGAdetermines how big a design can be that is implemented on that FPGA I/Oblocks are required to connect the FPGA to the outside world in addition to CLBs.FPGAs are thought of as parallel by nature because different parts of the FPGAcan be programmed to carry out operations on the same clock cycle [1]
Figure 2.12: An FPGA block diagram
Lookup tables (LUTs) and flip-flops (FFs) are the two fundamental partsthat make up the CLBs of an FPGA The combinatorial logic is handled by LUTs,which are truth tables Each LUT can be customized to function as any logic gaterather than having a predetermined number of ready logic gates [39] Flipflopsare binary registers that hold either a 1 or a 0 until the arrival of the next clockedge in order to save the state between clock cycles There would be no way tokeep track of statuses, state machines, or counters without the FFs There are twoother parts of an FPGA that should be discussed in addition to the LUTs and FFs.Block RAM (BRAM) is memory that is housed inside the FPGA, to start BRAM
is used for data that needs to be accessed without leaving the FPGA through the
Trang 36I/O blocks, even though memory can also be located outside of the FPGA, such
as with EPROM, SRAM, or SD cards The second category is DSP slices Whencertain common implementations are too resource-intensive and complex, prebuiltmultiplier-accumulate circuitry is typically used [1]
Hardware description languages (HDLs), most frequently VHDL or Verilog,are used to program FPGAs An HDL description of a design’s behavior is used
to program the FPGA with a copy of that design A soft-core CPU has beenimplemented in a portion of the FPGA for designs that require a processor, with theremaining free FPGA space being used for other functions FPGAs and CPUs haverecently been combined to create heterogeneous designs known as System on Chips(SoC) The Processing System (PS), which houses the CPU, and the ProgrammableLogic (PL), which houses the FPGA, are the two separate components that make
up these SoCs [1]
Figure 2.13: Front view of Ultra96-V2
The hardware chosen for this thesis is an Avnet-distributed Ultra96-V2board It has a ZU3EG multiprocessor system on a chip in place of it This chiphas an UltraScale architecture and is part of the Zynq UltraScale+ family TheUltra96-V2 is equipped with a dual-core ARM Cortex R5 and a quad-core ARMCortex A53 processor that together can run a full operating system Compared tothe fastest CPUs, the MPSoC’s FPGA enables hardware acceleration of up to afactor of 20 As a result, Avnet suggests the board as being perfect for high-speed
Trang 37AI Two USB 3.0 ports and 2GB of low-power double data rate 4 (LPDDR4) RAM,both of which are necessary for quick image processing, are also provided by theUltra96-V2 A monitor is connected to a device using a Mini DisplayPort (mDP).This ensures independent operation The PS ZU3EG consists of
Trang 38their CUDA parallel platform for GPUs, NVIDIA has advanced the use of GPUs
in other fields [1]
The Jetson Nano by NVIDIA is the system tested a GPU in this thesis It
is a single-board computer in the Raspberry Pi style that has a 128-core MaxwellGPU It has a microSD card for storage, a 1.43 GHz ARM A57 CPU, and 4GB ofRAM The board includes extras like Ethernet connectors, pin connectors, HDMIand DisplayPort ports, and USB ports Either a 5V/4A barrel connector or a5V/2.5A micro USB can be used to power the board The 5V/4A barrel connectorwas employed for this task The Jetson Nano can function as a standalone computerbecause it runs the full Ubuntu operating system As a result, it can benefit fromthe frameworks, tools, and libraries that Ubuntu offers to boost developmentefficiency To get the best performance out of the Jetson Nano, the power optionsmust be set to maximum
2.4.1 Pytorch
In 2016, PyTorch was made available More scientists are becoming open
to using PyTorch The website was run by Facebook Additionally, Facebook runsCaffe2 (Convolutional Architecture for Fast Feature Embedding) It is difficult
to convert a PyTorch-defined model into a Caffe2 model In September 2017,Facebook and Microsoft created the Open Neural Network Exchange (ONNX)with this objective in mind Simply put, ONNX was created for model conversionbetween frameworks In March 2018, Caffe2 and PyTorch were integrated
A very complex neural network can be easily constructed with PyTorch Ithas swiftly gained popularity as a result of this functionality It puts TensorFlow
up against some stiff competition in research projects The creators of PyTorch setout to create a highly imperative library that could quickly handle all numericalcomputation, and in the end, PyTorch was born Running and testing a portion ofthe code in real-time presented a significant problem for deep learning scientists,machine learning developers, and neural network debuggers This task is accom-plished by PyTorch, which also enables them to run and test their code in realtime Therefore, they don’t have to wait to see if it works
Trang 39Figure 2.15: A PyTorch workflow
2.4.1.0.1 Advantages of PyTorch Finishers currently use it consistently
in Kaggle competitions Like Python, PyTorch provides a straightforward userinterface It offers a simple approach to use API Like Python, this framework
is incredibly simple to use and run On both Windows and Linux, PyTorch issimple to comprehend or use PyTorch offers a new hybrid front-end that initiallyswitches to graph mode for speed, optimization, and functionality in the C++runtime environment, while providing flexibility and simplicity of use in eager mode.Distributed neural network model training is possible with PyTorch With theaid of native support for peer-to-peer communication and asynchronous execution
of collective operations from Python and C++, it offers optimum performance inboth research and production
Python serves as the foundation of PyTorch The most well-known Pythonlibraries and packages, including Cython and Numba, are used with PyTorch.Python has a close integration with PyTorch Its code is entirely written in Python.Pythonic refers to writing code that is more commonly used Python idioms thanJava and C++ For extending PyTorch and enabling development in fields likecomputer vision and reinforcement learning, there is a robust ecosystem of tools andpackages available This ecosystem was created by a vibrant group of researchersand developers These ecosystems aid in the development of Deep Learning NeuralNetworks that are adaptable and quick to access
2.4.2 Vivado and Vivado HLS
Xilinx provides a wide range of tools for use with their hardware In order
to create a design, run simulations, produce an RTL design, infer constraints, and
Trang 40ultimately produce a bitstream that can be loaded on the intended hardware, a usercan write their own IP or use pre-existing IP The hardware is written in an HDLlanguage, such as VHDL or Verilog A synthesis tool receives these register-transferlevel (RTL) descriptions and writes them to the FPGA The editions of the VivadoDesign Suite HLx offer this functionality A bitstream that can be loaded onto thetarget is Vivado’s output [1]
Software developers have the option to create accelerated applications using
C or C++ thanks to Vivado HLS Application programming interfaces (APIs) areused to build RTL IP and communicate with the hardware The g++ compilerand the v++ compiler, which are both included in the Vitis core developmentkit, are used to compile the application so that it can run on an x86 host It alsocomes with an ARM compiler for cross-compiling the application to run on a Xilinxdevice’s embedded processor [1]
2.4.3 PYNQ and BNN-PYNQ
Python can now be used on a Zynq SoC thanks to Xilinx’s PYNQ project.The SD card for the board housing the SoC has a bootable Linux image that isburned to it as the PYNQ image The required pynq Python packages are present,and it uses Ubuntu rootfs as its file system The board can be accessed from a
PC on the same LAN using a web browser when used in conjunction with JupyterNotebook For the hardware description on a SoC, PYNQ uses hardware librariesreferred to as overlays The PL of the SoC cannot be used without an overlay that
is appropriate for the project Although many online projects come with overlays,custom HDL designs necessitate the creation of an overlay that is unique to theproject [1] For PYNQ as well as the Ultra96v2 board, Jupyter notebooks andoverlays are available in the BNN-PYNQ GitHub repository maintained by Xilinx.The FINN-paper[12], which describes the networks used, forms the foundation ofthe repository There are numerous examples of how to use the provided overlaysfor performance tests in the notebooks The source code and installation packagesfor pip are available There is a script in the repository for rebuilding the hardwarefiles as well With this, a user can open the overlays’ Vivado and Vivado HLSprojects and modify them or use them as a starting point for their own customoverlay [1]