Design an efficient fpga based accelerator for real time parking occupancy detection

One way to implement ML is throughneural networks NN, which are part of what is referred to as deep learning DL.In order to train and classify data, deep learning DL refers to the use of

Trang 1

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF COMPUTER SCIENCE AND ENGINEERING

GRADUATION THESIS

DESIGN AN EFFICIENT FPGA-BASED

ACCELERATOR FOR REAL-TIME PARKING

OCCUPANCY DETECTION

Major: COMPUTER ENGINEERING

THESIS COMMITTEE : COMPUTER ENGINEERING

SUPERVISOR(s) : ASSOC PROF DR TRAN NGOC THINH

MR HUYNH PHUC NGHI MEMBER SECRETARY: ASSOC PROF DR PHAM QUOC CUONG

-o0o - STUDENT 1 : NGUYEN VU THANH NGUYEN - 1652437

HO CHI MINH CITY, 01/2023

Trang 2

TRƯỜNG ĐẠI HỌC BÁCH KHOA

KHOA: KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP

BỘ MÔN: KT Máy tính Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

HỌ VÀ TÊN: Nguyễn Vũ Thành Nguyễn MSSV: 1652437

NGÀNH: Kỹ thuật Máy tính LỚP:

1 Đầu đề luận án:

Design an efficient FPGA-based accelerator for real-time parking occupancy detection

2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):

• Research on a BNN approach in image classification task with CNRPark dataset

• Implementation of an encoder for input images and a set of weight parameters for

parking solutions

• Build and evaluate a hardware accelerator run on Ultra96v2 SoC with a pre-encoded

32x32 image set

3 Ngày giao nhiệm vụ luận án: 19/09/2022

4 Ngày hoàn thành nhiệm vụ: 09/01/2023

1) PGS TS Trần Ngọc Thịnh

2) KS Huỳnh Phúc Nghị

Nội dung và yêu cầu LVTN đã được thông qua Bộ môn

Ngày tháng năm 2023

(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ):

Trang 5

I hereby declare that I worked on and is the sole author of this bachelor thesisand that I have not used any sources other than those listed in the bibliographyand identified as references Other than that, the work presented is entirely myown.

Nguyen Vu Thanh Nguyen

Trang 6

This thesis has completely reached its result thanks to the continuous efforts

of myself, the support and encouragement of our lecturers, friends and families Iwould like to express our sincere attitude to those who have helped us throughoutthe study, research and during working on the thesis

I would first like to express my sincere gratitude to my supervisors, AssocProfessor Tran Ngoc Thinh and BEng Huynh Phuc Nghi have consistently helped

me out by providing me with not only the necessary tools to complete the thesisbut also with a wealth of information and direction so that I could move in thebest route They never stopped inspiring me and provided me the chance toparticipate in a really fascinating work that was built using a lot of the information

I had acquired during our time at university They have always been a kind,understanding teacher who encourages me to make changes to the work and thisthesis as needed Working with them and gaining expertise under their guidancewas an honor for me

In addition to thanking the supervisors, I also like to acknowledge thecouncilors at the thesis defense for their wise criticism and suggestions that helped

Finally, I would like to thank my parents for always providing a positiveenvironment for our development and for supporting me when I face difficulties inboth our academic and personal live

Nguyen Vu Thanh Nguyen

Trang 7

This thesis proposes, studies and examines an approach on developing anedge-ai smart parking solution, including hardware and software components byimplementation of the FracBNN+CNRPark model with hardware acceleration onthe Ultra96-V2 board This approach allows end-users to be able to monitor anddetect busy and free parking spaces automatically via security cameras The imageclassification model runs entirely on the edge, on the Ultra96-V2 board, withoutthe help of a server workstation.

Trang 8

Commitment 2

1.1 Purpose and Motivation 1

1.2 Scope and Objectives 3

1.2.1 Problem Statements 3

1.2.2 Objectives 3

1.3 Structure of Thesis 4

2 Background knowledge and Terminology 5 2.1 Software - Artificial Intelligence 5

2.1.1 The development of AI 5

2.1.2 Neural Network (NN) 6

2.1.3 Convolutional Neural Network (CNN) 11

2.1.4 Binary Neural Network (BNN) 15

Trang 9

2.2 Smart Parking concepts 16

2.2.1 Smart Parking 16

2.2.2 Edge AI 17

2.3 Hardware and constraints 18

2.3.1 FPGA and SoC 18

2.3.2 GPU 21

2.4 Tools and Frameworks 22

2.4.1 Pytorch 22

2.4.2 Vivado and Vivado HLS 23

2.4.3 PYNQ and BNN-PYNQ 24

2.5 Performance Criteria 25

2.5.1 Recall, Precision & F1-score 25

2.5.2 Average IoU & mean Average Accuracy (mAP) 26

2.5.3 Power consumption 27

2.5.3.1 FPS & Latency 27

3 Literature Review 29 3.1 Smart Parking Related works 29

3.1.1 International Solutions 29

3.1.1.1 Moscow Parking (Source: https://parking.mos.ru/) 29 3.1.1.2 SENSIT 30

3.1.1.3 SFPark 30

3.1.1.4 Cisco Smart+Connected City Parking 30

3.1.2 Solutions in Vietnam 31

3.1.2.1 My Parking 31

3.1.2.2 IParking 31

3.1.3 Smart Parking systems with Image Processing 31

3.2 Previous Group’s Thesis Result 33

3.3 Vien’s Thesis 34

Trang 10

3.3.0.1 License Plate Dataset 35

3.3.0.2 YOLOv3 Object Detection Model 35

3.3.0.3 Implementing Vien’s approach 38

3.3.0.4 Comparing YOLOv3 on Ultra96-V2 and JetsonNano 40 3.4 The Original BNN Model 41

3.5 Improved BNN Models 43

3.6 FracBNN (Dec 2020) 45

4 Methodology 49 4.1 The proposed solution - Previous thesis group 49

4.2 FracBNN+CNRPark 49

4.2.1 Model Architecture 50

4.2.2 Hardware Accelerator Architecture 52

4.3 Dataset 54

4.3.1 CNRPark+EXT 54

5 Implementation 56 5.1 FracBNN+CNRPark model on Pytorch 56

5.1.1 Training the model 56

5.1.1.1 Training dataset 56

5.1.1.2 Training routine 58

5.2 Hardware Acceleration on Ultra96-V2 60

5.2.1 Weights and Bias processing 61

5.2.2 Thermometer Encoding 63

5.2.3 Building model on Vivado HLS and Vivado 64

5.2.4 Inference on the Ultra96-V2 66

5.3 Evaluation 66

5.3.1 Training result 66

5.3.2 Hardware acceleration result 67

Trang 11

6 Conclusion 69

6.1 Summary 69

6.1.1 Comparing to Vien’s thesis 69

6.1.2 Trained and implemented FracBNN+CNRPark model on Pytorch, run on GPU 70

6.1.3 Implemented hardware acceleration via Vivado HLS on Ul-tra96v2 70

6.1.4 Achieve inference result using built thermometer encoder on Ultra96v2 71

6.2 Future Improvements 71

6.2.1 Accuracy degredation 71

6.2.2 Real-time application 72

Trang 12

2.1 Development timeline of AI, ML and DL [1] 6

2.2 a Neural Network with 4 layers 7

2.3 Composition of a hidden layer on the CNN 8

2.4 Five commonly used activation functions: (a) binary step function, (b) sigmoid function, (c) tanh function, (d) ReLU function, (e) leaky ReLU function.[1] 9

2.5 Structure of a CNN The data is run through several convolutional and pooling layers learning features in the image.[1] 12

2.6 Convolutional filter size 3x3 sweeping through image size 4x4 13

2.7 Output (darkgreen) from a convolutional filter 14

2.8 Convolutional filter sweeping with padding size 1, stride of 2 14

2.9 Output (darkgreen) of max pooling (left) and average pooling (right) 15 2.10 Smart Parking Solution 17

2.11 Edge AI Workflow 18

2.12 An FPGA block diagram 19

2.13 Front view of Ultra96-V2 20

2.14 Block diagram of the Ultra96-V2 21

2.15 A PyTorch workflow 23

2.16 Calculation of IoU 26

2.17 mAP visualization 27

2.18 JetsonStats GUI 28

3.1 Vietnamese License Plate Dataset 36

Trang 13

3.2 YOLOv3 performance result 36

3.3 YOLOv3 box bounding technique 37

3.4 Darknet-53 38

3.5 GPU Implementation Workflow 39

3.6 Benchmark with threshold IoU 50% 40

3.7 Benchmark with threshold IoU 75% 40

3.8 A visualization of the sign layer and Straight-Through Estimator (STE) While the real values of the weights are processed by the sign function in the forward pass, the gradient of the binary weights are simply passed through to the real valued weights.[2] 42

3.9 BNN Training Curve 42

3.10 Evolution of BNN Accuracy - Source: FracBNN introduction pre-sentation slides (FPGA2021) 46

3.11 Main contributions of FracBNN - Source: FracBNN introduction presentation slides (FPGA2021) 47

3.12 Input images need to be first encoded 47

3.13 Results of binarizing the input layer using thermometer encoding on CIFAR-10 – ResNet-20 BNN has 0.27 million parameters and 40.9 million BMACs [3] 48

3.14 Improving BNN by computing an additional sparse binary convolu-tion layer.[3] 48

4.1 Previous group’s proposed solution - Smart Parking System Archi-tecture [4] 50

4.2 FracBNN+CNRPark model architecture, based on Resnet20 [5] 51

4.3 Basic blocks - green highlights are the difference to ReActNet [6] model 52

4.4 Edge solution architecture 53

4.5 FracBNN accelerator architecture[3] 54

4.6 CNRPark+EXT dataset sample 55

5.1 FracBNN Training workflow 58

5.2 Generating an FPGA accelerator from trained FracBNN 61

Trang 14

5.3 Thermometer encoder workflow 645.4 Vivado HLS Utilization Estimate 65

Trang 15

2.1 Confusion Matrix [7] 25

3.1 Energy Consumption when Inferencing CNV Model on Ultra96v2 and VGG Model on Jetson Nano[4] 34

3.2 Summary of Vien’s thesis result 35

3.3 Summary and Comparison between Ultra96v2 and Jetson Nano 40

3.4 A table of major details of the methods presented in this section 45

3.5 Comparison of accuracies on the ImageNet dataset from works presented in this section Full precision network accuracies are included for comparison as well.[8] 45

5.1 Resources utilization on Ultra96v2 65

5.2 Accuracy comparison between models of input size 32x32 67

5.3 Inference results of prospective models on Ultra96v2 68

5.4 Power consumption on Ultra96v2 of different models 68

Trang 16

ASIC Application Specific Integrated Circuits

BNN Binary Neural Network

BRAM Block Random Access Memory

CNN Convolutional Neural Network

CPU Core Processing Unit

FF Flip-flop

FPGA Field Programmable Gate Array

GPU Graphics Processing Unit

MPSoC Multi-Processors System on Chip

LUT Look Up Table

PL Programmable Logic

RGB Red Green Blue

Trang 17

1.1 Purpose and Motivation

Machine learning became more well-known at the beginning of the first century This resulted from the need to process increasing amounts of dataand the availability of less expensive ML-capable hardware, like GPUs and RAM.Due to their inherent design, neural networks heavily rely on parallel computing to

twenty-be effective, making a GPU with its many cores the ideal tool for the job Due

to this, GPUs have been the norm for machine learning applications for the lastfew years, but new hardware has recently been introduced The Tensor ProcessingUnit, or TPU, a chip made specifically for machine learning, was unveiled byGoogle in 2016 As a way to obtain the effectiveness of a dedicated chip withoutthe restrictions of an ASIC, FPGAs have also been demonstrating promise overtime The development of machine learning has also greatly benefited from thecurrent trend of cloud computing More researchers can conduct their researchcost-effectively thanks to the availability of large GPU, TPU, or FPGA centers’computing power [1]

Each year, more and more applications for machine learning are released,signaling the field’s rapid expansion Today, we use applications on a daily basis,whether they are in our cars, phones, medical software, or almost any other hi-techproduct To be accurate and effective, machine learning requires a lot of data, and

as machine learning becomes more popular, more data is being sent across variousnetworks We send enormous amounts of data back and forth, particularly forapplications that collect data in the field, send it to a datacenter for processing,and then return the results As a result, the system that collects the data needs tofurther develop its ability to process data at the network’s edge For instance, thiscould significantly reduce the amount of data on the networks for a video stream

Trang 18

that counts the number of cars on a highway The counted number of cars can besent to a datacenter whenever necessary rather than sending full scale images 30times per second Machine learning will permeate more and more aspects of ourlives, as it stands today There are countless uses for intelligent machines that canassist us in performing tasks without being micromanaged An application shouldalways perform processing as close to the edge as possible to avoid clogging ourcommunication networks with data.[1]

As Vietnam today continues to develop economically, our country has seen asteady increase in the number of cars in traffic, facilitating the needs for more, andbigger parking lots to accommodate it This has led to many new problems for bothdrivers and management, one of which is the increasing difficulty in identifying andfinding empty parking space Smart Parking emerged as a viable solution to theseissues, but this system still can be improved To further optimize and enhance itsperformance, we propose the implementation of hardware acceleration

In the semester HK192, the previous group has already devised a solution[9] for the Smart Parking problem The solution is very clever, as the usage of IoTdevices such as sensors has not been utilized due to it’s high cost and difficulty toimplement This is because for an IoT solution to work, there needs to be a sensor

in each parking space, and a very high-performance server to process all of thesignal and data of each sensor The maintenance cost is also high as the sensorsare prone to damage due to environmental factor

The solution devised is to use security cameras which already is present

in almost all parking spaces along with edge devices hub which will process theimages from the camera in order to detect vacant parking spaces The data isprocessed directly at the edge devices before sending to the server for reduced load.Besides, the group also implemented a web server, a database system along sidewith a phone application for real world usage

However, the image processing system in the edge devices are still at aprimitive stages with many issues in providing data for the Smart Parking app.Therefore, we’re tasked with improving the system, especially the AI system at theedge devices for better image detection and processing

Trang 19

1.2 Scope and Objectives

1.2.1 Problem Statements

The first problem of the thesis is to improve the current state of SmartParking: To detect cars at the parking spaces The system which is already inplace has implemented a Deep learning Neural Network (NN) AI system to detectthe cars classes, but it still has a lot of issues Such as low accuracy for precisenumber of cars detected at the parking slots The system is also incapable ofdetermining how many parking spaces are detected as vacant from the total ofparking spaces detected In the current state of the system, the parking spaces aredesignated manually and the system could tell if the space is vacant or not This

is an okay solution but for a Smart Parking system, parking spaces would need to

be detected automatically without human intervention

The second problemthat the thesis is trying to solve is the implementationthe said detection model of Cars and Parking space classes on to an Edge device.This implementation also has to provide real-time detection mechanism on theEdge without the assistance or processing power of the server workstation Thisputs many constraints on the NN model as it has to comply with the limitations

of the current hardware These constraints consists of processing speed, memoryand power consumption The NN model implemented has to make trade-offsfor accuracy with latency, memory consumption and power consumption of thehardware limitations

1.2.2 Objectives

Considering the trade-offs mentioned, a suitable neural network model has

to be carefully researched and implemented based on the targeted hardware Afterthe considerations during pre-thesis, I have decided that a BNN would be mostapplicable for hardware usage A BNN is a type of neural network that storesvalues of weights and activation tensors as -1 and 1 While it does have drawbacks,BNN also has many advantages over other neural network models, the greatestone being that is considerably lighter and can be run on edge devices such asMulti-Processors System on Chip (MPSoC) For this project, we aim to implementour own BNN model for the purpose of Smart Parking Implementation consists

of building, training and evaluation of the Neural Network Another goal for thethesis is to implement the above BNN model on to a suitable device on the edge

Trang 20

capable of AI processing The hardware that I have chosen for this purpose is theArm-based, Xilinx Zynq UltraScale+TM MPSoC development board Ultra96v2.Another important task is to test the implemented system in a real environment

as to evaluate the solution and its usability in real life conditions based on suitablemetrics

The remainder of the thesis is organized as follows In Chapter 2, firstly,

I will review about the overall background including history of development of

AI, Convolutional Neural Network and Binary Neural Network Secondly, I alsoinclude the information about the hardware: definition of FPGA, MPSoC and

GPU; informations about the Ultra96-V2 board and JetsonNano In Chapter 4,

I went through a number of related works of BNN, FracBNN and also hardwareacceleration solutions in the past Then a proposed solution is defined, with the

architecture of the system and the approach in hardware acceleration In Chapter

5, the actual implementation of the FracBNN+CNRPark model and the embedding

on Ultra96-V2 is described In Chapter 3, the results of the various experiments

are investigated to compare the effectiveness of the FracBNN+CNRPark model, aswell as a comparison between FPGA implementation on the Ultra96-V2 and GPUimplementation on the Jetson Nano Eventually, my accomplishments, difficulties

so far are evaluated, together with rooms for improvement are proposed in Chapter

6

Trang 21

"AI" refers to all machines and technologies that, on a high level, can be considered

to learn new traits, adapt to the environment, understand complex concepts, orsolve challenging problems [1]

The area of AI known as machine learning (ML) is where a machine actuallylearns about the world around us and then applies that knowledge In ML, asystem is given access to enormous amounts of data and tasked with learningfrom or improving upon the data without explicit programming There are somedistinctions among the various ML techniques In supervised learning, a trainingdataset with known labels is provided to the algorithms The system improveditself to predict the proper data class to the proper label It can then predict

a classification for new data and present a suitable label when given that data.Unsupervised learning, on the other hand, trains the system using unlabeled data.Although the system is unable to determine what the data is supposed to be andtherefore cannot produce the proper output for any new data, it does identifyrecurring patterns and structures in the provided data that can be further examined.[1]

A third approach is known as reinforcement learning, in which the system istrained through trial-and-error searching and is rewarded when the desired output

is obtained In order to maximize its performance, the system is forced to search

Trang 22

for the ideal behavior in each situation One way to implement ML is throughneural networks (NN), which are part of what is referred to as deep learning (DL).

In order to train and classify data, deep learning (DL) refers to the use of multiplelayers in a neural network It is one of many low-level ML and AI implementations.[1]

Figure 2.1: Development timeline of AI, ML and DL [1]

2.1.2 Neural Network (NN)

Artificial neural networks, also known as neural networks, are modeled afterthe neuronal network in the brain, where data is processed by sending signals fromone neuron to another The neurons in a NN are organized into layers, and thelayers each process the data differently The data to be processed is always received

by an input, and the output layer always provides the output prediction Theremay be additional hidden layers between them that further process the data Thedepth of the network, as used in deep learning, is the quantity of layers in an NN.[1]

Trang 23

Figure 2.2: a Neural Network with 4 layers

Each neuron in a NN functions as a sum function, taking inputs fromneurons in the layer below and adding them all up Different parameters are added

to the network to help it find features as well as inputs Since each input to aneuron has a weight W, some inputs are more important than others A biascan be included after the weighted inputs have been added together The data’sintended future use is then determined by an activation function acting on thetotal sum The input to neurons in the following layer is determined by the output

of the activation function and the weight W The parameters W and B are whatdetermine the performance of a given NN architecture where the type and number

of layers are set The formula below describes the connection between an input,weights, bias, and output of a neuron [1]

Output = φ(XN

i=0

Trang 24

Figure 2.3: Composition of a hidden layer on the CNN

To decide whether a neuron should be active or not at the next layer, theactivation function is used The most basic form of activation function is a binarystep function, which produces 0 for inputs below a threshold and 1 for inputs abovethe threshold The only information provided by this kind of activation function iswhether the neuron is active or not Non-linear functions like the sigmoid or tanhfunctions are other examples of activation functions These functions normalizethe output while also providing a smooth gradient between their maximum andminimum values Rectified Linear Unit (ReLU) function and leaky ReLU functionare two additional varieties of activation functions The section below discusses theadvantages and disadvantages of each of these functions, which are mostly related

to training [1]

Trang 25

Figure 2.4: Five commonly used activation functions: (a) binary step function, (b)

sigmoid function, (c) tanh function, (d) ReLU function, (e) leaky ReLU function.[1]

The training phase and the inference phase, also known as the classificationphase, are the two distinct stages that a neural network goes through whenprocessing a particular type of data These stages frequently switch places duringunsupervised and reinforced learning The network performs training, inference,parameter updates, tests with additional inference, and so forth In supervisedlearning, the training is always completed before drawing any conclusions Because

we expect the network to label similar data during inference, we feed it labeleddata during the training phase [1]

A network’s parameters, or set of weights and biases, are iteratively updatedduring training in order to produce the desired results Backpropagation and

a loss function are frequently used to accomplish this The output values fromthe network are compared to the desired value using a loss function Categoricalcrossentropy is a widely used loss function: [1]

where represents the value predicted by the network The next step is

to reduce the loss function, which provides the best possible set of parameters.Gradient descent is a popular technique for minimizing the loss function, where

Trang 26

the derivate of the loss function is first examined to determine how the parametersshould be changed The selection of the activation function is crucial at this point.The sigmoid and tanh functions’ gradients approach zero for very large or verysmall values, the gradient of the binary step function is zero, and the gradient

of the ReLU function disappears for negative numbers As a result, you cannotavoid training your network by minimizing the loss function Just to counter this,the leaky ReLU has a tiny gradient for negative numbers Backpropagation isthe entire procedure of getting the output of the loss function and updating theparameters back through the network [1]

The training phase has additional parameters that govern how the rameters W and B are updated These parameters, which are referred to ashyperparameters, are set when the training process begins The learning rate,batch size, and number of epochs are the hyperparameters that are most frequentlymentioned After the gradient descent has determined which direction to change aparameter, the learning rate sets the size of the change Each iteration will updatethe parameters in very small steps if the learning rate is too low, and the trainingwill take longer The parameters’ values may diverge with an excessive learningrate Batching is frequently used to shorten training periods Multiple outputs areprocessed at once rather than each parameter being updated after each output.Thus, the batch size determines how many outputs will be batched One epoch isequal to one pass of the training dataset on the network, and the number of epochssimply controls how many times this is done [1]

pa-The trained network’s performance is set using the hyperparameters ever, there are times when the network gets trained too well on a dataset, leading

How-to longer inference times and lower accuracy It is overly focused on that dataset,which means it might no longer be able to recognize general features This is similar

to learning the word "car" and only seeing a Ferrari sportscar Even though a Fordhas four wheels and a steering wheel, the person may not immediately recognize it

as a car if they later come across one They don’t appear similar to that person, andthe Ford is not a car As previously mentioned, the hyperparameters can be altered

to avoid the parameters being overfit Dropout is a different technique that involvesremoving neurons from the network at random, effectively deactivating them Thisconstant network modification ensures that the network is not overfitted [1]

It has been demonstrated that the neural network’s performance is icantly impacted by the parameter datatype 32-bit floating point (FP32) hastraditionally dominated deep learning, but quantization techniques are now used

signif-to bring the precision down signif-to FP16, INT8, or even binary Reducing the number

Trang 27

of bits used for the parameters is referred to as quantization The number range

that can be represented in FP32 is ±3.4 × 1038.Quantizing FP32 bit parametersinto, say, INT8 entails mapping all FP32 bit parameter values to the INT8 bitparameter range, which is [-128,127] Additionally, more aggressive quantization isapplied, even for parameters as small as ternary (-1, 0, 1) or binary (-1, 1) Theprecision is decreased when parameters are quantized to a smaller datatype, butthe bandwidth and memory usage are greatly reduced as a result Convolutionsand fully connected layers can be calculated for the ternary and binary case usingaddition and subtraction, which requires less computing power [1]

One of the many uses for neural networks is image processing, which isthe main objective in this thesis Image manipulation, augmentation, and dataprocessing are all included in the field of image processing Classifying what animage represents or locating objects within an image and subsequently classifyingthose objects are typical uses A neural network is used for image classification

to identify features in an image that together help determine what the imagerepresents Finding pertinent objects is the first step in object detection Typically,this is accomplished by highlighting the region where pertinent objects can befound, after which the region is examined as an image classification problem [1]

2.1.3 Convolutional Neural Network (CNN)

A convolutional neural network (CNN) is a type of neural network thatsearches through input data for features using convolutional operations Since a2D-convolution can be represented with matrix multiplication as explained below,CNNs are well-liked networks when the data is in the form of an image Toclassify an image into one of a number of predetermined outputs, they first identifyminor features in the image and then combine these features to create a morecomprehensive understanding of what the image represents [1]

Trang 28

Figure 2.5: Structure of a CNN The data is run through several convolutional and

pooling layers learning features in the image.[1]

A width by height image can be thought of as a 2D matrix, with each elementdenoting the value of a corresponding pixel These numbers, which typically fallbetween 0 and 255, indicate the pixel’s intensity A single matrix is used torepresent a black-and-white image, with 0 denoting black, 255 denoting white, andgreyscale existing in the middle Three matrices—one for red, one for green, andone for blue—are stacked on top of each other to create colored RGB images Thisadds depth, the image’s third dimension The input to the CNN then changes to

a 32x32x3 matrix for a 32x32 RGB image One function is convoluted on top ofanother in the convolution formula, which is where the concept of CNNs comesfrom: [1]

f (x) ∗ g(x) =Z ∞

The input matrix and the convolution filter are two of the functions in thecase of a CNN The 2D matrix used by the convolution filter is also intended to find

a specific feature The filter outputs a feature map after sweeping the input matrix

In reality, a single convolutional layer is used to scan the input data through anumber of different feature-finding filters, resulting in a 3D matrix of the featuremap that ignores the depth of the input data

(I ∗ K) xy=Xh

w

X

Trang 29

The input matrix and the convolution filter are two of the functions in thecase of a CNN The 2D matrix used by the convolution filter is also intended to find

a specific feature The filter outputs a feature map after sweeping the input matrix

In reality, a single convolutional layer is used to scan the input data through anumber of different feature-finding filters, resulting in a 3D matrix of the featuremap that ignores the depth of the input data This is an example of the application

of a convolutional filter to input data and the resulting output by the following:

Figure 2.6: Convolutional filter size 3x3 sweeping through image size 4x4

Convolution uses several filters to increase the depth of the image whiledecreasing the width and height Using reduction padding, the width and heightare avoided With padding, zero-valued elements are added to the input matrix’sborders, giving the option to achieve convolution with an output size that is equal

to or larger than the image Changing the filter’s stride length while it sweepsacross the input will also alter the size of the output [10]

Trang 30

Figure 2.7: Output (darkgreen) from a convolutional filter

The pooling operation, also known as the pooling layer, is another type ofoperation An illustration of a CNN’s structure A three-channel picture is thenetwork’s input The data is put through a number of convolutional and poolinglayers that teach the image’s features Fully connected layers then classify theimage and determine the appropriate label using the learned features By groupingseveral pixels and reducing them to a single pixel in the pooling layer, the size andconcentration of the image are decreased Either the maximum value from thegroup is saved, which is known as max pooling, or the average is used, which isknown as average pooling

Figure 2.8: Convolutional filter sweeping with padding size 1, stride of 2

The data is run through a flattening layer after the convolution and poolinglayers, which turns the 3D matrix into a 1D array for the following layers One-dimensional (1D) arrays with each element represented by a neuron make up fully

Trang 31

connected layers Every neuron in the layer with complete connectivity is linked toevery neuron in the layer below Applying the softmax function, which normalizesthe 1D array values into a probability function, is the final step in a CNN Thisprovides us with a valuable prediction that can be expressed as an accuracy rate.

Figure 2.9: Output (darkgreen) of max pooling (left) and average pooling (right)

There are a number of CNN architecture examples that are frequently used

or mentioned due to their influence on the field or due to their effectiveness Theseinclude AlexNet, which contains five convolutional layers and three fully connectedlayers and was the first to use ReLU-activation, and LeNet-5, which containsthree convolutional layers followed by two fully connected layers and is currentlyused as a template for stacking convolutional and fully connected layers Moreinstances such as VGG-16, which added 16 layers in all to go even further; usingblocks with multiple layers and stacking them, Inception-V1/V3/V4 creates anetwork-in-a-network structure, true to its name Another one is ResNet-50, one

of the first networks to employ batch normalization and using up to 152 layers; Acollection of networks called MobileNets that are intended for TensorFlow mobileand embedded application [10]

2.1.4 Binary Neural Network (BNN)

One of the new, prominent type of NN is the BNN, it is also called theBinarized Convolutional Neural Network as it functions nearly the same as a CNN,also with convolutional layers The parameters in binary neural networks (BNNs)are quantized to a precision of one or zero As a result of issues during training,early BNNs were not strictly BNN Since the gradient descent method on a binaryfunction prevents updating the weight in tiny steps, backpropagation could not

be used These BNNs either used different training techniques or started outwith real-valued parameters that were then quantized Later, gradient descent,backpropagation, and the use of binary values for weights and activations were made

Trang 32

possible They gain from the reduction of the dot product between weights andactivations to bitwise operations The XNOR logical operation, which can be easilyimplemented in hardware, is equivalent to multiplying binary values AlthoughBNNs typically have lower accuracy than their higher precision counterparts, theyare significantly less resource and memory intensive, and their inference times aremuch shorter [1]

x b = Sign(x) =

(

+1 if x > 0

Compared to normal neural network, Binary neural network is more suitable

to be used by edge devices such as MPSoCs However, it does have drawbacks,most notable of which is that the gradient of function tends to be 0 BNN modelsare observed to be less accurate than normal neural network Still, due to itslightweight advantage and compatibility to work on edge devices, BNN is constantlybeing improved and there are methods to help improve accuracy The paper byCourbariaux et al., who were the first to present a fully binarized network, containsthe most frequently cited BNN By including a gain term to make up for theinformation that was lost during binarization, XNOR-Net enhanced the BNN.Many other highly cited networks, including DoReFa-Net, ABC-Net, and BNN+,focused on enhancing binary neural network training The use of BNNs in embeddedsystems and faster inference times are demonstrated by eBNN [1]

2.2.1 Smart Parking

The proposed solution of Smart Parking is to use a heterogeneous computingplatform to detect parking lot occupancy The solution allows the integration ofsurveillance camera systems to collect and process data for the Smart Parkingsystem, refer to Figure

Using Edge AI, the system will process directly the input from surveillancecameras on the Edge node before transferring data to server For the purpose ofthis project, a AI detection model will be implemented on the Edge Node that willdetect occupancy based on license plate detection.[4]

Trang 33

Figure 2.10: Smart Parking Solution

2.2.2 Edge AI

Edge AI refers to the installation of AI software on hardware throughoutthe real world The reason it is named "edge AI" is because, as opposed tobeing done centrally in a cloud computing facility or private data center, the AIcomputation is done close to the user at the edge of the network, close to wherethe data is located The edge of the network can refer to any area because theinternet is accessible everywhere It might be a department shop, factory, hospi-tal, or one of the gadgets we see every day, like traffic lights, robots, and phones.[11]

With the help of developments in edge AI, robots and gadgets can nowfunction with the "intelligence" of human cognition wherever they may be Smartapplications with AI capabilities can learn to carry out the same activities undervarious conditions, much like in real life A variety of advantages, including real-timeinsights, lower costs, increased privacy, high availability, and persistent improve-ment, are brought about by the development of neural networks and IoT devices.[11]

Trang 34

Figure 2.11: Edge AI Workflow

The inference engine in edge AI deployments operates on some sort ofcomputer or device in remote areas including factories, hospitals, automobiles,satellites, and residences The problematic data is frequently sent to the cloudwhen the AI runs into a difficulty so that the original AI model can be furthertrained before it eventually takes the place of the inference engine at the edge.When edge AI models are implemented, they continue to get more and moreintelligent thanks to this feedback loop, which significantly improves modelperformance.[11]

By recognizing and utilizing the various strengths of hardware, computerprocesses or computing operations can be sped up This is also true for applicationsinvolving machine learning, where achieving high performance inference requiresthe use of hardware The fact that binary arithmetic and matrix multiplication can

be significantly accelerated using parallel hardware is used in applications usingneural networks [1]

2.3.1 FPGA and SoC

Field Programmable Gate Arrays (FPGAs) are semiconductor devices thatcan be programmed and reprogrammed to the desired functionality or application.Even though they are typically less effective than an ASIC for any given task,they have the advantage of being reprogrammable as a design changes They werepreviously frequently used for ASIC prototyping or for lower volume designs and

Trang 35

products, but they are now preferred in a wide range of applications FPGAs arecurrently used in a variety of sectors and industries, including consumer goods,aerospace, automotive, medicine, and data centers [1]

Configurable logic blocks (CLBs) arranged in a matrix and connected byprogrammable connections make up FPGAs These CLBs perform the functions oflogical simulation, and the connections determine how the CLBs are connected.The number of logic blocks that can fit inside the physical space of an FPGAdetermines how big a design can be that is implemented on that FPGA I/Oblocks are required to connect the FPGA to the outside world in addition to CLBs.FPGAs are thought of as parallel by nature because different parts of the FPGAcan be programmed to carry out operations on the same clock cycle [1]

Figure 2.12: An FPGA block diagram

Lookup tables (LUTs) and flip-flops (FFs) are the two fundamental partsthat make up the CLBs of an FPGA The combinatorial logic is handled by LUTs,which are truth tables Each LUT can be customized to function as any logic gaterather than having a predetermined number of ready logic gates [39] Flipflopsare binary registers that hold either a 1 or a 0 until the arrival of the next clockedge in order to save the state between clock cycles There would be no way tokeep track of statuses, state machines, or counters without the FFs There are twoother parts of an FPGA that should be discussed in addition to the LUTs and FFs.Block RAM (BRAM) is memory that is housed inside the FPGA, to start BRAM

is used for data that needs to be accessed without leaving the FPGA through the

Trang 36

I/O blocks, even though memory can also be located outside of the FPGA, such

as with EPROM, SRAM, or SD cards The second category is DSP slices Whencertain common implementations are too resource-intensive and complex, prebuiltmultiplier-accumulate circuitry is typically used [1]

Hardware description languages (HDLs), most frequently VHDL or Verilog,are used to program FPGAs An HDL description of a design’s behavior is used

to program the FPGA with a copy of that design A soft-core CPU has beenimplemented in a portion of the FPGA for designs that require a processor, with theremaining free FPGA space being used for other functions FPGAs and CPUs haverecently been combined to create heterogeneous designs known as System on Chips(SoC) The Processing System (PS), which houses the CPU, and the ProgrammableLogic (PL), which houses the FPGA, are the two separate components that make

up these SoCs [1]

Figure 2.13: Front view of Ultra96-V2

The hardware chosen for this thesis is an Avnet-distributed Ultra96-V2board It has a ZU3EG multiprocessor system on a chip in place of it This chiphas an UltraScale architecture and is part of the Zynq UltraScale+ family TheUltra96-V2 is equipped with a dual-core ARM Cortex R5 and a quad-core ARMCortex A53 processor that together can run a full operating system Compared tothe fastest CPUs, the MPSoC’s FPGA enables hardware acceleration of up to afactor of 20 As a result, Avnet suggests the board as being perfect for high-speed

Trang 37

AI Two USB 3.0 ports and 2GB of low-power double data rate 4 (LPDDR4) RAM,both of which are necessary for quick image processing, are also provided by theUltra96-V2 A monitor is connected to a device using a Mini DisplayPort (mDP).This ensures independent operation The PS ZU3EG consists of

Trang 38

their CUDA parallel platform for GPUs, NVIDIA has advanced the use of GPUs

in other fields [1]

The Jetson Nano by NVIDIA is the system tested a GPU in this thesis It

is a single-board computer in the Raspberry Pi style that has a 128-core MaxwellGPU It has a microSD card for storage, a 1.43 GHz ARM A57 CPU, and 4GB ofRAM The board includes extras like Ethernet connectors, pin connectors, HDMIand DisplayPort ports, and USB ports Either a 5V/4A barrel connector or a5V/2.5A micro USB can be used to power the board The 5V/4A barrel connectorwas employed for this task The Jetson Nano can function as a standalone computerbecause it runs the full Ubuntu operating system As a result, it can benefit fromthe frameworks, tools, and libraries that Ubuntu offers to boost developmentefficiency To get the best performance out of the Jetson Nano, the power optionsmust be set to maximum

2.4.1 Pytorch

In 2016, PyTorch was made available More scientists are becoming open

to using PyTorch The website was run by Facebook Additionally, Facebook runsCaffe2 (Convolutional Architecture for Fast Feature Embedding) It is difficult

to convert a PyTorch-defined model into a Caffe2 model In September 2017,Facebook and Microsoft created the Open Neural Network Exchange (ONNX)with this objective in mind Simply put, ONNX was created for model conversionbetween frameworks In March 2018, Caffe2 and PyTorch were integrated

A very complex neural network can be easily constructed with PyTorch Ithas swiftly gained popularity as a result of this functionality It puts TensorFlow

up against some stiff competition in research projects The creators of PyTorch setout to create a highly imperative library that could quickly handle all numericalcomputation, and in the end, PyTorch was born Running and testing a portion ofthe code in real-time presented a significant problem for deep learning scientists,machine learning developers, and neural network debuggers This task is accom-plished by PyTorch, which also enables them to run and test their code in realtime Therefore, they don’t have to wait to see if it works

Trang 39

Figure 2.15: A PyTorch workflow

2.4.1.0.1 Advantages of PyTorch Finishers currently use it consistently

in Kaggle competitions Like Python, PyTorch provides a straightforward userinterface It offers a simple approach to use API Like Python, this framework

is incredibly simple to use and run On both Windows and Linux, PyTorch issimple to comprehend or use PyTorch offers a new hybrid front-end that initiallyswitches to graph mode for speed, optimization, and functionality in the C++runtime environment, while providing flexibility and simplicity of use in eager mode.Distributed neural network model training is possible with PyTorch With theaid of native support for peer-to-peer communication and asynchronous execution

of collective operations from Python and C++, it offers optimum performance inboth research and production

Python serves as the foundation of PyTorch The most well-known Pythonlibraries and packages, including Cython and Numba, are used with PyTorch.Python has a close integration with PyTorch Its code is entirely written in Python.Pythonic refers to writing code that is more commonly used Python idioms thanJava and C++ For extending PyTorch and enabling development in fields likecomputer vision and reinforcement learning, there is a robust ecosystem of tools andpackages available This ecosystem was created by a vibrant group of researchersand developers These ecosystems aid in the development of Deep Learning NeuralNetworks that are adaptable and quick to access

2.4.2 Vivado and Vivado HLS

Xilinx provides a wide range of tools for use with their hardware In order

to create a design, run simulations, produce an RTL design, infer constraints, and

Trang 40

ultimately produce a bitstream that can be loaded on the intended hardware, a usercan write their own IP or use pre-existing IP The hardware is written in an HDLlanguage, such as VHDL or Verilog A synthesis tool receives these register-transferlevel (RTL) descriptions and writes them to the FPGA The editions of the VivadoDesign Suite HLx offer this functionality A bitstream that can be loaded onto thetarget is Vivado’s output [1]

Software developers have the option to create accelerated applications using

C or C++ thanks to Vivado HLS Application programming interfaces (APIs) areused to build RTL IP and communicate with the hardware The g++ compilerand the v++ compiler, which are both included in the Vitis core developmentkit, are used to compile the application so that it can run on an x86 host It alsocomes with an ARM compiler for cross-compiling the application to run on a Xilinxdevice’s embedded processor [1]

2.4.3 PYNQ and BNN-PYNQ

Python can now be used on a Zynq SoC thanks to Xilinx’s PYNQ project.The SD card for the board housing the SoC has a bootable Linux image that isburned to it as the PYNQ image The required pynq Python packages are present,and it uses Ubuntu rootfs as its file system The board can be accessed from a

PC on the same LAN using a web browser when used in conjunction with JupyterNotebook For the hardware description on a SoC, PYNQ uses hardware librariesreferred to as overlays The PL of the SoC cannot be used without an overlay that

is appropriate for the project Although many online projects come with overlays,custom HDL designs necessitate the creation of an overlay that is unique to theproject [1] For PYNQ as well as the Ultra96v2 board, Jupyter notebooks andoverlays are available in the BNN-PYNQ GitHub repository maintained by Xilinx.The FINN-paper[12], which describes the networks used, forms the foundation ofthe repository There are numerous examples of how to use the provided overlaysfor performance tests in the notebooks The source code and installation packagesfor pip are available There is a script in the repository for rebuilding the hardwarefiles as well With this, a user can open the overlays’ Vivado and Vivado HLSprojects and modify them or use them as a starting point for their own customoverlay [1]

Tiêu đề	Design an Efficient FPGA-Based Accelerator for Real-Time Parking Occupancy Detection
Tác giả	Nguyen Vu Thanh Nguyen
Người hướng dẫn	PGS. TS Trần Ngọc Thịnh, KS. Huỳnh Phúc Nghị
Trường học	Vietnam National University Ho Chi Minh City - Ho Chi Minh City University of Technology
Chuyên ngành	Computer Engineering
Thể loại	graduation project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	91
Dung lượng	3,19 MB