TECHNOLOGY AND EDUCATION MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF DESIGN START STOP CIRCUIT THROUGH TRAFFIC DETECTION... HO CHI MINH CITY UNIVERSITY OF TECHNOL
Trang 1
TECHNOLOGY AND EDUCATION
MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF
DESIGN START STOP CIRCUIT THROUGH
TRAFFIC DETECTION
Trang 2HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
GRADUATION PROJECT
Ho Chi Minh City, 24 December 2022
DESIGN START STOP CIRCUIT THROUGH TRAFFIC DETECTION
Students: Le Chan Pham ID: 18145045
Vo Huy Vu ID: 18145080
Advisor: Assoc.Prof Do Van Dung
Trang 3HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
Advisor: Assoc.Prof Do Van Dung
Ho Chi Minh City, 24 December 2022
Trang 4THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December, 2022
GRADUATION PROJECT ASSIGNMENT
Student name: LE CHAN PHAM Student ID: 18145018
Student name: VO HUY VU Student ID: 18145030
Major: Automotive engineering technology
Advisor: Assoc.Prof DO VAN DUNG Phone number: 0966879932
Date of assignment: Octorber 2022 Date of submission: December 2022
1 Project title: Design Start Stop circuit through object detection
2 Equipment: Laptop with GPU, HD Camera, Arduino UNO
3 Content of the project: Research convolutional neural networks, YOLO algorithm model, train YOLO model, evaluate the results and use output to control Arduino
4 Final product: Traffic light detection system through webcam, videos and images
CHAIR OF THE PROGRAM ADVISOR
Sign with full name Sign with full name
Trang 5THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December, 2022
ADVISOR’S EVALUATION SHEET
Student name: LE CHAN PHAM Student ID: 18145045
Student name: VO HUY VU Student ID: 18145080
Major: Automotive engineering technology
Project title: Design Start Stop circuit through object detection
6 Mark: ……… (In words: )
Ho Chi Minh City, … month, … day, … year
ADVISOR
(Sign with full name)
Trang 6THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December, 2022
PRE-DEFENSE EVALUATION SHEET
Student name: LE CHAN PHAM Student ID: 18145045
Student name: VO HUY VU Student ID: 18145080
Major: Automotive engineering technology
Project title: Design Start Stop circuit through object detection
6 Mark: ……… (In words: )
Ho Chi Minh City, … month, … day, … year
REVIEWER
(Sign with full name)
Trang 7THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December, 2022
EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER
Student name: LE CHAN PHAM Student ID: 18145045
Student name: VO HUY VU Student ID: 18145080
Major: Automotive engineering technology
Project title: Design Start Stop circuit through object detection
Name of Defense Committee Member: ………
6 Mark: ……… (In words: )
Ho Chi Minh City, … month, … day, … year
COMMITTEE MEMBER
(Sign with full name)
Trang 9ACKNOWLEDGE
Throughout our studies and graduation process, my team was always cared for,
guided, and assisted by teachers from the Faculty of High Quality Training, as well as
the support and assistance from friends and colleagues
First and foremost, we want to thanks The Board of Directors of Ho Chi Minh City
University of Technology and Education for creating all conditions in terms of facilities
along with modern equipment and library system with a variety of documents, which is
convenient for students in order to research information
We would like to express our gratitude to the instructor Assoc.Prof Do Van Dung for
assisting and leading us in complete this project
Because of the team’s limited experience, this study will have errors when practicing
and finishing the graduation thesis We are looking forward to hearing feedback and
advice from professors to help us complete our report
Sincerely thank you!
Ho Chi Minh City, 24th December 2022
Trang 10CONTENTS
DISCLAIMER i
ACKNOWLEDGE ii
CONTENTS iii
ABSTRACT xii
CHAPTER 1: INTRODUCTION 1
1.1 Reason for choosing topic: 1
1.2 Scope of research: 2
1.3 Project structure: 2
1.4 Thesis limited: 2
CHAPTER 2: FUNDAMENTALS 3
2.1 Overview of traffic lights system: 3
2.2 Overview of Engine Start Stop system: 3
2.2.1 How does engine start stop system work? 4
2.2.2 What are the benefits of Stop-Start? 5
2.2.3 What are the downsides of Stop-Start? 5
2.3 Introduction to Deep Learning: 6
2.3.1 What is Deep Learning: 6
2.3.2 The difference between the Machine Learning and Deep Learning : 7
2.3.3 Some neural network in Deep Learning: 8
2.4 Overview of Convolutional Neural Network in image classification: 12
2.4.1 What is Convolutional Neural Network? 12
2.4.2 Convolutional Neural Network Architecture: 14
2.5 ResNet: 21
2.6 How to detect an object: 23
2.7 Introduction some object detection algorithm: 25
2.7.1 R-CNN (Region–based Convolutional Neural Networks): 26
2.7.2 Fast R-CNN: 27
2.7.3 Faster R-CNN: 28
2.7.4 SSD (Single Shot Multi-box Detector): 30
CHAPTER 3: YOLO ALGORITHM MODEL 32
3.1 What is YOLO? 32
Trang 113.2 YOLO algorithm model: 32
3.2.1 Prediction output in YOLO: 32
3.2.2 Anchor box: 38
3.2.3 Multi-label image classification: 39
3.2.4 Non-maximum suppression (NMS): 40
3.2.5 Intersection Over Union (IOU): 40
3.2.6 YOLO network architecture: 41
3.2.7 Loss Function: 45
4.2.8 Training YOLO: 75
4.2.9 Mean Average Precision (mAP): 79
CHAPTER 4: DESIGN IDEAL ENGINE START-STOP SYSTEM MODEL AND ALTERNATIVE ENGINE START-STOP SYSTEM MODEL 82
4.1 Design ideal engine start-stop system model: 82
4.1.1 Changes in the classic engine start-stop system model : 82
4.1.2 Components of the ideal engine start-stop system model: 83
4.1.3 Process of the ideal engine start-stop system model: 89
4.2 Alternative engine start-stop system model: 91
4.2.1 Model overview: 91
4.2.2 Hardware: 92
4.2.3 Software: 97
CHAPTER 5: OPERATION RESULT AND FUTURE DEVELOPMENT 101
5.1 Operation: 101
5.1.1 Label image for training: 101
5.1.2 Training Yolo on Google Colab: 104
5.1.3 Operate Yolo on Windows: 106
5.1.4 Connect Arduino to Python: 107
5.2 Final result: 108
5.3 Conclusion 112
5.3.1 Strength: 112
5.5.2 Weakness: 113
5.4 Future development 113
5.4.1 Hardware: 113
5.4.2 Software: 114
Trang 12REFERENCES 115
Trang 13LIST OF FIGURES AND TABLES
Figure 2 1 Diagram of the engine start stop circuit 4
Figure 2 2 Engine Start Stop button on Mercedes 4
Figure 2 3 Comparison between Machine Learning and Deep Learning 8
Figure 2 4 The typical structure of ANN 9
Figure 2 5 Perceptron 9
Figure 2 6 A looping constraint on the hidden layer of ANN turns to RNN 10
Figure 2 7 Operation example of RNN 10
Figure 2 8 Unrolled RNN 11
Figure 2 9 Output of Convolution 12
Figure 2 10 CNN – Image Classification 12
Figure 2 11 Comparing the different between ANN, RNN, CNN 12
Figure 2 12 Layers in a CNN network 13
Figure 2 13 CNN network model – AlexNet 13
Figure 2 14 The image that the computer sees 14
Figure 2 15 Convolution between input and a kernel to generate data for a hidden layer neuron 15
Figure 2 16 Example of a convolutional layer 15
Figure 2 17 Graph of Sigmoid function 17
Figure 2 18 Graph of ReLU 18
Figure 2 19 Graph of Leaky ReLU function 19
Figure 2 20 Example of pooling layer 20
Figure 2 21 Fully-Connected Layer 21
Figure 2 22 The relationship between network depth and performance 21
Figure 2 23 Residual Block model 22
Figure 2 24 Object detection in computer vision 24
Figure 2 25 Image processing diagram 24
Trang 14Figure 2 26 R-CNN model 26
Figure 2 27 R-CNN 27
Figure 2 28 Fast R-CNN 28
Figure 2 29 Faster R-CNN model 29
Figure 2 30 SSD model 30
Figure 3 1 YOLO input image is divided into 7 ×7 33
Figure 3 2 Example of calculating boundary box coordinates in 448× 448 size 34
Figure 3 3 Output of YOLOv3 36
Figure 3 4 Output of YOLOv3 37
Figure 3 5 Anchor box solves the problem of detecting many objects that appear on the same output image area 38
Figure 3 6 Example of multi-object recognition (person and vehicle) appearing in the same area 39
Figure 3 7 YOLOv3 can detect objects with similar characteristics such as women and people 40
Figure 3 8 Ratio between area of overlap and area of unio 41
Figure 3 9 General architecture of YOLO 41
Figure 3 10 YOLOv1 architecture 42
Figure 3 11 YOLOv2 architecture 43
Figure 3 12 YOLOv3 architecture 45
Figure 3 13 How the classification loss function work 46
Figure 3 14 Formula to estimate boundary box from anchor box 49
Figure 3 15 MS COCO object detection 52
Figure 3 16 An object detection mode 53
Figure 3 17 Dense block layers 54
Figure 3 18 Dense Net 55
Figure 3 19 Cross-stage-partical-connection 55
Trang 15Figure 3 20 Object detection process 56
Figure 3 21 Applying SPP in Yolo( without DC block) 57
Figure 3 22 Yolo with SPP (with DC block) 57
Figure 3 23 Path Aggregation Network (PAN) 58
Figure 3 24 The design of Neck 59
Figure 3 25 In yolov4, the researchers changed add function to contact function 59
Figure 3 26 In yolo4, the researchers changed add function to contact function 60
Figure 3 27 Spatial Attention Module 60
Figure 3 28 Convolutional Block Attention Module 61
Figure 3 29 Yolov4-SAM 61
Figure 3 30 CutMix data augmentation 62
Figure 3 31 Mosaic data augmentation 63
Figure 3 32 DropBlock regularization 63
Figure 3 33 Algorithm DropBlock 64
Figure 3 34 Class label smoothing 64
Figure 3 35 Mish activation 65
Figure 3 36 Output landscape for Mish comparison 66
Figure 3 37 Multi-input weighted residual connections 67
Figure 3 38 Deepwise Conv block 68
Figure 3 39 MobleNetv2 convolution 68
Figure 3 40 Invert Residual Block 69
Figure 3 41 CmBN 72
Figure 3 42 Net layer in the cfg file 76
Figure 3 43 It should be noted that the [yolo] layers and [convolution] layers are configured before [yolo] when you want to detect selected objects 79
Figure 3 44 Illustration of TP and FP 80
Figure 4 1 Classic Engine Start-Stop system 82
Trang 16Figure 4 2 Ideal Engine Start-Stop system 83
Figure 4 3 HD Camera 84
Figure 4 4 NVIDIA Jetson Nano A02 85
Figure 4 5 What's on NVIDIA Jetson Nano 85
Figure 4 6 Battery VARTA AGM LN6 605901053 12V 105AH 86
Figure 4 7 Ideal Start-Stop system work flow 90
Figure 4 8 Alternative Start=Stop work flow 92
Figure 4 9 HD Camera 93
Figure 4 10 Laptop Dell G3 Gaming with NVIDIA GTX 1050 Ti GPU 94
Figure 4 11 Laptop Specifications 95
Figure 4 12 Arduino UNO R3 96
Figure 5 1 Model diagram 101
Figure 5 2 Training data folder 102
Figure 5 3 Predefined-classes file 103
Figure 5 4 A file label of an image 104
Figure 5 5 Clone Yolov7 from Github 104
Figure 5 6.Install necessary library 104
Figure 5 7 Try to detect with pretrain weight 105
Figure 5 8 Unzip training data from Drive 105
Figure 5 9 Reorganize the training data folder 105
Figure 5 10 Start to train YOLO 106
Figure 5 11 Try to detect after training and print out result 106
Figure 5 12 Open Yolo path and import library 106
Figure 5 13 Start detect model by run detect.py file 107
Figure 5 14 Detect in real-time with image on mobile phone 107
Figure 5 15 Upload example to Arduino 108
Trang 17Figure 5 16 Connect Arduino to Python 108
Figure 5 17 Logic code for connecting arduino 108
Figure 5 18 Detect image on the internet 109
Figure 5 19 Detect an image on the internet 109
Figure 5 20 Detect object small and far away 110
Figure 5 21 Detect in low bright condition 111
Figure 5 22 Detect with flared condition 111
Figure 5 23 Detect video 111
Figure 5 24 Detect video with obstacle 112
Figure 5 25 Detect in real time 112
Trang 18LIST OF ABBREVIATIONS
AI: Artificial Intelligence
CNN: Convolutional Neural Network
RNN: Recurrent Neural Network
ANN: Artificial Neural Network
IoT: Internet of Things
ResNet: Residual Neural Network
R-CNN: Region-based Convolutional Neural Network Fast
R-CNN: Fast Region-based Convolutional Neural Network Faster
R-CNN: Faster Region-based Convolutional Neural Network
SSD: Single Shot Multi-box Detector
YOLO: You Only Look Once
MAP: Mean Average Precision
IOU: Intersection Over Union
GPU: Graphics Processing Unit
CPU: Central Processing Unit
DNN: Deep Neural Network
CUDA: Compute Unified Device Architecture
CuDNN: NVidia CUDA® Deep Neural Network
ReLU: Rectified Linear Unit
RoI: Region of Interest
FPS: Frame Per Second
Trang 19ABSTRACT
The world is witnessing a rapid change in the future of artificial intelligence Automobile brands are investing millions of dollars in developing information technology Thanks to object detection, we can manufacture many different automatic systems
Because of that, we decide to improve the traditional engine start stop system efficiency by adding the traffic light detection system Firstly, we must learn about the object detection algorithm We also know about the engine start stop system working principle and its electric circuit In this project, we develop an object detection model base on YoloV7 model and use Python language in operation
Trang 20CHAPTER 1: INTRODUCTION
1.1 Reason for choosing topic:
Traditional engine start-stop technology’s working principle is that the engine stops once the brake pedal has been depressed for 2 seconds, and runs again when the brake pedal is depressed again, which helps save energy However, this trigger technology has two important disadvantages:
− When a vehicle stops for red light for less than 5 seconds, the fuel consumed
by activating the engine start-stop technology is more than when the engine idles for
a time for the red light
− It only considers the vehicle status, stopping or running, but neglects the road status, especially road congestion, which leads to frequent start-stop activation, further affecting both vehicle stability driving comfort
The main reason for the above disadvantages is the unintelligent engine start-stop system trigger To solve this problem, this project combines the traditional engine start-stop system with traffic lights detection using Yolo algorithm model System can effectively improve the driving experience, reduce engine fuel consumption, and help promote traditional engine start-stop technology
In recent years, the wave of artificial intelligence is exploding strongly and its application are endless The technology and AI application can be applied in many fields such as healthcare, self-driving cars, smart home, social media, space exploration,… However, the application of AI in real world requires not only high accuracy but also fast response speed In object detection, there are many advanced models born to solve this problem, but most of them cannot be used in real time due
to large computational resource requirements
For this reason, it is necessary to study and research the computer vision Nowadays, many countries have been applying computer vision to daily life, such as China's SkyNet, BKAV's AI View camera for counting people, social distancing, face recognition… Thus, it can be seen that the use of AI to exploit image data has been a trend
Beside that, YOLO is one of the most advanced object detection models available today YOLO is an end-to-end model that uses a single deep neural network to train and label as well as determine the position of each object appearing on the frame (instead of using 2 neural network and training on each new network in order to give the same predictions like previous models) It can be said that YOLO has built a first approach to make the problem of object detection really possible in life Based on the above statement and with the suggestion of the lecturer, we decided to choose the
Trang 21topic “Research object detection technology for vehicle by using python” as the research topic for the graduation project
1.2 Scope of research:
The thesis: “Design engine start stop circuit through traffic detection” is carried out with the following aims:
− Learning about engine start stop system
− Learn about CNN network and YOLO algorithm
− Learn how to use (train and detect) YOLO and the application of YOLO in many different coding environments
− Try to detect images and videos in real time using OpenCV library and Python programming language
− The potential of engine start stop system using object detection
− Design engine start stop system with object detection model
− Simulate engine start stop system with object detection model
1.3 Project structure:
Chapter 1: Introduction
Chapter 2: Fundamentals
Chapter 3: Yolo algorithm model
Chapter 4: Design ideal engine start-stop model and alternative start-stop model Chapter 5: Operation, results and future development
1.4 Thesis limited:
In this project, our team only have 11 weeks to study, research, construct, improve and develop model Because of the time limitation, we do not have enough time to design an ideal engine start-stop system and connect it to a real-life vehicle In addition, the lack of equipment prevents us to build a complete start-stop circuit Because the price of an ECU Engine Start-Stop is too high for us to afford and the ECU is not available in school resources, we are only able to export the Digital signal
to control the LED instead send it to the ECU Therefore, our main destination in this thesis is to design, create and simulate how object detection system can connect and transmit signal to the ECU by control LED signal
Trang 22CHAPTER 2: FUNDAMENTALS
2.1 Overview of traffic lights system:
Traffic lights are signaling devices positioned at road intersections, pedestrian crossings and other’s locations in other to control flows of traffic Traffic lights consist normally of three signals, transmitting meaningful information to drivers and riders through colors and symbols The regular traffic light colors are red, yellow, and green arranged vertically or horizontally in that order Although this is internationally standardized, variations exist on national and local scales as to traffic light sequences and laws
The method was first introduced in December 1868 on Parliament Square in London to reduce the need for police officers to control traffic Since then, electricity and computerized control has advanced traffic light technology and increased intersection capacity The system is also used for other purposes, for example, to control pedestrian movements, variable lane control (such as tidal flow systems or smart motorways), and railway level crossings
A set of lights, known as a signal head, may have one, two, three, or more aspects The most common signal type has three aspects facing the oncoming traffic: red on top, amber below, and green below that Additional aspects may be fitted to the signal, usually to indicate specific restrictions or filter movements
2.2 Overview of Engine Start Stop system:
A vehicle start-stop system or stop-start system automatically shuts down and restarts the internal combustion engine to reduce the amount of time the engine spends idling, thereby reducing fuel consumption and emissions This is most advantageous for vehicles which spend significant amounts of time waiting at traffic lights or frequently come to a stop in traffic jams Start-stop technology may become more common with more stringent government fuel economy and emissions regulations
Trang 23Figure 2 1 Diagram of the engine start stop circuit
Figure 2 2 Engine Start Stop button on Mercedes
2.2.1 How does engine start stop system work?
The start-stop system detects when the car is stationary and on the basis of sensors
it determines a series of other factors about the operating mode of the vehicle If the driver has stopped at a traffic light and sets the transmission to neutral, the start-stop system stops the engine (the fuel supply system and the engine ignition system will stop working temporarily) With some more recent models, the engine even switches off if the speed falls below a certain value
Although the engine, and therefore the primary source of power for all systems
is switched off, all of the electrical consumers and assistants are still supplied with power This is provided by the battery of the vehicle As soon as the clutch is actuated, the automatic start-stop system restarts the engine
Trang 24For vehicles with automatic or dual clutch transmissions, the automatic start-stop system responds to actuation of the brake alone If the vehicle is braked to a standstill and the driver’s foot remains on the brake pedal, the automatic start-stop system stops the engine When the brake is released, the automatic system starts the engine again
At this time, the battery will stimulate an electric current to the starter, drive the flywheel to rotate, fuel continues to be pumped and the engine works again This process takes less than 1 second
This system is controlled by the ECU through a main relay Parameters are received and calculated through the accelerator pedal position sensor, speed sensor, brake pedal position sensor
When the engine resumes, it will drive the electrical systems to normal operation, the generator will perform the process of recharging the battery if the battery power
is below the specified level
2.2.2 What are the benefits of Stop-Start?
The idea behind the start-stop system is simple: If the engine is stopped for short periods, for example while waiting at traffic lights, fuel consumption and emissions are reduced In this way, the automatic start-stop system helps to save fuel and protect the climate With this technology, CO2– emissions can be reduced by 3 – 8% and the same with fuel consumption The benefits to the environment and improved efficiency have caused a rapid spread of automatic start-systems to all classes of vehicle.It is estimated that by 2020, this technology can save up to 1.6 billion gallons
of fuel (1 gallon = 3.785411784 liters) and help reduce 8 tons of CO2 emissions at
14 locations around the world
The main benefits are two folds Firstly, it reduces pollution An idling car creates pointless pollution and by turning it off you won't be producing any at all Secondly,
it keeps the engine temperature from becoming too hot while the vehicle is running Pollution is an increasing problem in many towns and cities, so every little reduction helps Secondly there is a fuel saving to be had Granted it's not a huge amount
But if much of your driving is in stop start traffic it will all add up A third minor benefit is that it's quieter and more relaxing sitting in a car that's not thrumming away
at idle
2.2.3 What are the downsides of Stop-Start?
Sadly, it's not a perfect system, and there are some downsides The primary one
is that while the main intention of the device is to lower emissions you have to wonder if we're robbing Peter to pay Paul How much pollution is caused by the
Trang 25manufacture of the extra components required, and how much more waste is create
at the vehicles end of life?
Traditional engine start-stop technology’s working principle is that the engine stops once the brake pedal has been depressed for 2 seconds, and runs again when the brake pedal is depressed again, which helps save energy However, this trigger technology has two important disadvantages:
− When a vehicle stops for red light for less than 5 seconds, the fuel consumed
by activating the engine start-stop technology is more than when the engine idles for
a time for the red light
− It only considers the vehicle status, stopping or running, but neglects the road status, especially road congestion, which leads to frequent start-stop activation, further affecting both vehicle stability driving comfort
Car manufacturers are being pushed to meet ever more strict emissions guidelines, and Stop-Start technology helps them achieve these targets – but there doesn't appear to be any studies which take into consideration the levels of pollution caused during production
As Stop-Start places extra demand on components you need specific, powerful batteries and more robust starters and engine mounts While these shouldn't have lifespans any shorter than those on a regular car, the cost of replacement can be substantially higher than non Stop-Start equipped cars, plus the added complexity is likely to make labor charges higher on cars undergoing work in these areas
But the main downside is a lot of people simply don't like the sensation of their car automatically turning off, and manufacturers have identified that many owners just turn off the feature when they get in the car
It's something they're not used to, and don't really understand, or fully trust But our advice has to be to always leave Stop-Start engaged if your car is equipped with it
2.3 Introduction to Deep Learning:
2.3.1 What is Deep Learning:
Deep learning is a subset of machine learning, which is essentially a neural network with three or more layers These neural networks attempt to simulate the behavior of the human brain—albeit far from matching its ability—allowing it to
“learn” from large amounts of data While a neural network with a single layer can still make approximate predictions, additional hidden layers can help to optimize and refine for accuracy
Deep learning drives many artificial intelligence (AI) applications and services that improve automation, performing analytical and physical tasks without human
Trang 26intervention Deep learning technology lies behind everyday products and services (such as digital assistants, voice-enabled TV remotes, and credit card fraud detection)
as well as emerging technologies (such as self-driving cars)
2.3.2 The difference between the Machine Learning and Deep Learning :
Human
Intervention - Requires more ongoing human
intervention to get results
- More complex to set up but requires minimal intervention thereafter
Time - Can be set up and operate
quickly but may be limited in the power of their results
- Take more time to set up but can generate results instantaneously
Approach - Require structured data and
uses traditional algorithms like linear regression
- Employs neural networks and is built to accommodate large volumes of unstructured data
Applications - Use in your email inbox, bank,
Enables more complex and autonomous programs, like self-driving cars or robots that perform advanced surgery
Table 2 1 The difference between the Machine Learning and Deep Learning
Trang 27Figure 2 3 Comparison between Machine Learning and Deep Learning
2.3.3 Some neural network in Deep Learning:
2.3.3.1 Artificial Neural Network (ANN):
ANN consists of 3 layers – Input, Hidden and Output The input layer accepts the inputs, the hidden layer processes the inputs, and the output layer produces the result Essentially, each layer tries to learn certain weights
ANN can be used to solve problems related to:
- Tabular data
- Image data
- Text data
Trang 28Figure 2 4 The typical structure of ANN
One of the main reasons behind universal approximation is the activation function Activation functions introduce nonlinear properties to the network This helps the network learn any complex relationship between input and output
Trang 29b×w2= 0.6
c×w3= 1.4
For w1, w2, w3 are the weights
The output of the neuron will then be y = a(x) = a(2.7 + bias)
2.3.3.2 Recurrent Neural Network (RNN):
RNN has a recurrent connection on the hidden state This looping constraint ensures that sequential information is captured in the input data
We can use recurrent neural networks to solve the problems related to:
- Time series data
- Text data
- Audio data
Figure 2 6 A looping constraint on the hidden layer of ANN turns to RNN
RNN captures the sequential information present in the input data i.e., dependency between the words in the text while making predictions:
Trang 30As you can see here, the output (o1, o2, o3, o4) at each time step depends not only
on the current word but also on the previous words
RNNs share the parameters across different time steps This is popularly known
as Parameter Sharing This results in fewer parameters to train and decreases the computational cost
Figure 2 8 Unrolled RNN
As shown in the above figure, 3 weight matrices – U, W, V, are the weight matrices that are shared across all the time steps
2.3.3.3 Convolution Neural Network (CNN):
Convolutional neural networks (CNN) are all the rage in the deep learning community right now These CNN models are being used across different applications and domains, and they’re especially prevalent in image and video processing projects
The building blocks of CNNs are filters a.k.a kernels Kernels are used to extract the relevant features from the input using the convolution operation Let’s try to grasp the importance of filters using images as input data Convolving an image with filters results in a feature map:
Trang 31Figure 2 9 Output of Convolution
CNN learns the filters automatically without mentioning it explicitly These filters help in extracting the right and relevant features from the input data
Figure 2 10 CNN – Image Classification
CNN captures the spatial features from an image Spatial features refer to the arrangement of pixels and the relationship between them in an image They help us
in identifying the object accurately, the location of an object, as well as its relation with other objects in an image
(time series, text, audio)
Image data
Vanishing & Exploding
Gradient
Figure 2 11 Comparing the different between ANN, RNN, CNN
2.4 Overview of Convolutional Neural Network in image classification: 2.4.1 What is Convolutional Neural Network?
Convolutional neural network (CNN) is one of the special feedforward networks
Trang 32CNN network is the most popular and advanced deep learning model today Most of the current image recognition and processing systems use the CNN network because
of its fast processing speed and high accuracy In a traditional neural network, the layers are considered one-dimensional, while in the CNN, the layers are considered three- dimensional: height, width, and depth The CNN network has two important concepts: the local receiving field and the parameter sharing These concepts contribute to reducing the number of weights that need to be trained, thereby increasing the speed of computation
Figure 2 12 Layers in a CNN network
CNN is typically composed of three types of layers (or building blocks):
- Convolution
- Pooling
- Fully connected layers
The fully connected layer is like regular neural networks, and the convolutional layer performs multiple convolutions on top of the previous layer The pooling layer can reduce the sample size per block of the previous layer In CNNs, the network architecture often overlaps these three layers to build the full architecture
Figure 2 13 CNN network model – AlexNet
Computers see images differently than humans The image seen by the computer
is represented as an array containing the values of pixels This array of values can
be a 2-D array for grayscale images or a 3-D array for RGB color images Tensor is
Trang 33used to call arrays with dimension greater than or equal to 3 1-dimensional tensor
is an array, 2- dimensional tensor is a matrix A color image of size 512× 512 pixels
is a 3-dimensional tensor (512, 512, 3), where 3 represents depth, also known as R,
B, G color Usually for image processing, there will be a lot of parameters to calculate A tensor image (512, 512,3) will have 512× 512× 3 = 785432 input parameters A neural network has 2 hidden layers, each hidden layer has 16 neurons and 2 neurons in the output, the number of parameters to be calculated will be:
785432 × 16 + 16 × 16 + 16 × 2 = 12567200 Building a good recognition model will need more hidden layers and also the number of neurons in each layer Thus, the number of parameters to be calculated will be even larger A large number of computations can slow down the model, in addition to requiring expensive, modern computer In many cases, the amount of computation can exceed the ability of current computers
Figure 2 14 The image that the computer sees
2.4.2 Convolutional Neural Network Architecture:
2.4.2.1 Local Receptive Field:
Through several studies of image processing, researchers found that features in an image are often local, and pixels that are close together are often interconnected Thus, the network architecture can convert a fully connected network to a locally connected network, reducing computational complexity This is one of the main ideas
in CNN0
Trang 34Figure 2 15 Convolution between input and a kernel to generate data for a hidden layer neuron
Kernel also known as filter, is often used to extract features contained in an image The kernel can be a matrix or a 3-dimensional tensor if the input is a color image The depth of the kernel depends on the depth of the input The kernel traverses the entire input and performs scalar multiplication over the regions it passes through These areas are called locally connected fields The way the kernel moves
on the input is like the Sliding Window technique in image processing, going from left to right, top to bottom The result is a feature map containing the results of the performed scalar calculations The depth of the output is the number of kernels used
in that convolutional layer So when it comes to the convolution layer, it means using the kernel to scan the entire input of that layer and perform scalar multiplication on each region the kernel scans through The end result is the output of the convolutional layer
Figure 2 16 Example of a convolutional layer
Trang 35The input to the upper convolutional layer is an image of size 32× 32× 3 The convolution layer uses 17 kernels of size 5× 5× 3, strides are 1 Each kernel contains 5× 5× 3 weights, in the learning process of these weights will be adjusted So the output size will be:
There is input (H× W× D) use kernel (k× k× D), number of kernels is N, stride is
s and padding is p The general formula for calculating the output of a1 convolutional layer is :
1305600 However, each neuron has the ability to locally connect to an area of the image through the 5×5×3 kernel and the same kernel it will scan the entire input 32×
32 So the actual number of parameters will be:
(Size of kernel)× (Number of kernels)=(5×5×3)×17=1275 Thanks to the ability of CNN to share parameters, the number of parameters to
be calculated during training is significantly reduced
Trang 362.4.2.3 Activation Function:
In a general neural network, the activation function acts as the nonlinear component at the output of the neurons In the classification and identification problem, the data points are discrete Without nonlinear activation functions, the neural network of even multiple layers will still be effective as a linear layer, which makes it inapplicable to classification or recognition problems
Suppose with input X, output is Y, weight is W In the first layer, we have the formula in the sum function:
𝑍1 = 𝑊1.X Then Z is pushed into a nonlinear activation function g(x) = cx where c is a real number
𝑎1= g(𝑍1)=c.𝑍1
Similarly, the output of the first layer 𝑎1 is the input of the second layer:
𝑍2 = 𝑊2.𝑎1=𝑊2.c.𝑍1=𝑊2.c.𝑊1.X
Sigmoid function
Figure 2 17 Graph of Sigmoid function
The Sigmoid function takes a real number as input and converts it to a value in the range (0, 1) The input of a very small negative real number will give the output asymptote to 0, conversely, if the input is a large positive real number, the output will be a number asymptotically to 1 In the past, the Sigmoid function was often used because has a very nice derivative However, at present, the Sigmoid function
Trang 37is rarely used because of the following disadvantages:
- Sigmoid function storm and error gradient: A noticeable disadvantage is that when the input has a large absolute value (negative or positive), the gradient of this function will be very close to 0 This means that the coefficient corresponding to the unit will be almost completely unchanged (also known as the Vanishing Gradient phenomenon)
- The Sigmoid function has no center of 0 which makes it difficult to converge
ReLU (Rectified Linear Unit ) function
Figure 2 18 Graph of ReLU
The ReLU function is being used quite a lot in recent years when training neural networks ReLU simply reduce values less than 0 The advantages of ReLU are:
- The convergence speed is much faster ReLU has a convergence speed 6 times faster than Sigmoid This may be because the ReLU is not have Vanishing Gradient like Sigmoid
- Faster calculation Sigmoid uses the exp function and the formula is much more complex than ReLU, so it will cost more to calculate
Trang 38Leaky ReLU function:
Figure 2 19 Graph of Leaky ReLU function
Leaky ReLU is an attempt at eliminating dying ReLU Instead of returning zero for inputs less than zero, the Leaky ReLU generates a slightly sloped bevel There are many reports that Leaky ReLU is more effective than ReLU, but this effect is not clear and consistent
In addition, Leaky ReLU, there is a well-known variant of ReLU, which is PReLU PReLU is similar to Leaky ReLU but allows the neuron to automatically choose the best α coefficient
2.4.2.4 Pooling layer:
The pooling layer will reduce the size of the image immediately after performing the convolution, helping to retain the most prominent features and properties of the image This allows to reduce the amount of computation when the image is too large, while not losing important features of the image
Although we have used locally connected networks and shared parameters, the number of parameters in the neural network is still too large Compared to a relatively small data set, it can cause overfitting Therefore, artificial neural networks often insert pooling layers into the network The pooling layer processes to gradually reduce the number of parameters to improve the computation time in the neural network The pooling layer applies downsampling to the previous layer using the max The pooling layer operates independently on each of the previous layers Also,
Trang 39it is possible to set the amounts of pixels when we move the kernel slide or s stride equal to 2 as do with convolution layer
Figure 2 20 Example of pooling layer
In the above example, the kernel size is 2 x 2 and the stride is 2 At each window, the max function will take the maximum value to represent the value of the next layer There are two types of pooling: If the kernel size is equal to the stride, it is Traditional Pooling If the sliding window size is larger than the stride, it is Overlapping Pooling In practice, neural networks often use a kernel size of 2 x 2 with a stride size of 2 in pooling and use a kernel size of 3 x 3 with a stride size of 2
in pooling, because if increasing window size will very easily lose the characteristics
of the data In addition to pooling using the max function, one can use other functions For example, one can use the kernel's average function to calculate the value for the next layer, called the average pooling
2.4.2.5 Fully-Connected Layer:
The third layer in a CNN network is the fully connected layer This layer is like
a traditional neural network: the neurons in the previous layer connect to a neuron in the next layer, and the last layer is the output To be able to import images from previous layers, the output data must be flattened into a multidimensional vector Finally, use the SoftMax function to perform object classification
Trang 40Figure 2 21 Fully-Connected Layer
2.5 ResNet:
ResNet (stand for Residual Network) is a deep learning network with CNN architecture that received attention in 2012 after the LSVRC2012 competition and became popular in the field of machine vision ResNet makes it possible and efficient
to train hundreds or even thousands of layers of neural networks
Since AlexNet, CNN architectures are getting deeper and deeper While AlexNet has only 5 convolutional layers, VGG and GoogleNet networks (aka Inception_v1) have 19 and 22 layers respectively However, increasing network depth is more than simply stacking layers together Deep networks are difficult to train because of the vanishing gradient problem – since the gradient is propagated back to previous layers, repeated multiplication can make the gradient extremely small As a result, the network's performance decrease rapidly
Figure 2 22 The relationship between network depth and performance
The main idea of ResNet is to use a uniform shortcut connection to traverse one
or more layers Such a block is called a Residual Block as shown in the following