MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION FACULTY FOR HIGH QUALITY TRAINING TRƯƠNG QUANG PHÚC, M.Eng PHẠM MINH QUÂN NGUYỄN HOÀI PHƯƠNG
Trang 1MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH QUALITY TRAINING
TRƯƠNG QUANG PHÚC, M.Eng PHẠM MINH QUÂN
NGUYỄN HOÀI PHƯƠNG UYÊN
DESIGN AND IMPLEMENTATION OF
CLASSFICATION AND DELIVERY BASED
ON COMPUTER VISION
SKL 0 0 9 6 9 7
Trang 2HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND
EDUCATION FACULTY FOR HIGH QUALITY TRAINING
Student ID: 18119053
Major: COMPUTER ENGINEERING TECHNOLOGY
Ho Chi Minh City, December 2022
Trang 3HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND
EDUCATION FACULTY FOR HIGH QUALITY TRAINING
GRADUATION PROJECT
DESIGN AND IMPLEMENTATION OF CLASSIFICATION AND DELIVERY BASED ON
COMPUTER VISION
Student ID: 18161031
NGUYỄN HOÀI PHƯƠNG UYÊN
Student ID: 18119053
Major: COMPUTER ENGINEERING TECHNOLOGY
Advisor: TRƯƠNG QUANG PHÚC, M.Eng
Ho Chi Minh City, December 2022
Trang 4THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December 25, 2022
GRADUATION PROJECT ASSIGNMENT
Student name: Phạm Minh Quân Student ID: 18161031
Student name: Nguyễn Hoài Phương Uyên Student ID: 18119053
Major: Computer Engineering Technology Class: 18119CLA
Advisor: Trương Quang Phúc, MEng Phone number: _ Date of assignment:
_
Date of submission: _
1 Project title: Design and Implementation of classification and delivery based on Computer Vision
2 Initial materials provided by the advisor: _
3 Content of the project: _
4 Final product:
CHAIR OF THE PROGRAM
(Sign with full name)
ADVISOR
(Sign with full name)
Trương Quang Phúc
Trang 5THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December 25, 2022
ADVISOR’S EVALUATION SHEET
Student name: Phạm Minh Quân Student ID: 18161031
Student name: Nguyễn Hoài Phương Uyên Student ID: 18119053
Major: Computer Engineering Technology
Project title: Design and Implementation of classification and delivery based on Computer Vision
Advisor: Trương Quang Phúc, MEng
EVALUATION
1 Content of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
Approved
5 Overall evaluation: (Excellent, Good, Fair, Poor)
Good
6 Mark: 9.0 (in words: .)
Ho Chi Minh City, December 25, 2022
ADVISOR
(Sign with full name)
Trương Quang Phúc
Trang 6HO CHI MINH CITY UNIVERSITY
TECHNOLOGY AND EDUCATIO
FACULTY FOR HIGH QUALITY TRAINING
THE SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
Ho Chi Minh City, January 13, 2023
MODIFYING EXPLANATION OF THE GRADUATION PROJECT
MAJOR: COMPUTER ENGINEERING
1 Project title: Design and Implementation of classification and delivery based on Computer
Vision
2 Student name: Phạm Minh Quân Student ID: 18161031
Student name: Nguyễn Hoài Phương Uyên Student ID: 18161031
3 Advisor: Trương Quang Phúc, Meng
4 Defending Council: Council 2, Room: A3-404, 3rd January 2023
5 Modifying explanation of the graduation project:
TT Council comments Editing results Note
1 Many figures in chapter 2 are reused from other
sources without providing the related references
Many figures in chapter 2 are provided the related references
2 Visual quality of many figures in chapter 3 is
very low and hard to follow
Many figures in chapter 3 are modified to improve their visual quality
3
In conclusion: Author should clearly point out
which objective have accomplished instead of a
general summarization
The conclusion section is clearly pointed out which objective have accomplished
4 The flowchart of figure 3.6 must have a “Begin”
point and a “End” Point in a terminator shape The flowchart of figure 3.6 is modified
Trang 7THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, December 25, 2022
PRE-DEFENSE EVALUATION SHEET
Student name: Phạm Minh Quân Student ID: 18161031
Student name: Nguyễn Hoài Phương Uyên Student ID: 18119053
Major: Computer Engineering Technology
Project title: Name of Reviewer:
EVALUATION
1 Content and workload of the project
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
Trang 8THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER
Student name: Phạm Minh Quân Student ID: 18161031
Student name: Nguyễn Hoài Phương Uyên Student ID: 18119053
Major: Computer Engineering Technology
Project title:
Name of Defense Committee Member:
EVALUATION
1 Content and workload of the project
2 Strengths:
3 Weaknesses:
4 Overall evaluation: (Excellent, Good, Fair, Poor)
Trang 9i
SUPERVISOR APPROVAL
Trang 10ii
ACKNOWLEDGEMENTS
Over the course of undertaking the project, our group received a plenty of valuable support that incentivize us to overcome all the problems and challenges and end up this quite hard and meaningful project
Firstly, we would like to thanks to the School Board of the Ho Chi Minh City University
of Technology and Education and Faculty for High Quality Training creating wonderful conditions for me to take my project
Secondly, sincerely thank to Mr.Trương Quang Phúc, our advisor who gave us useful guidance and instruction that help us to finish our project successfully From these advices we can improve our project contents and correct the mistakes as well
Thirdly, we are grateful to all of the nice classmates of class 18119CLA who contributed
to give advice and warm guidance whenever we need the support
Last but not least, Due to limited knowledge and implementation time, we cannot avoid errors We look forward to receiving your comments and suggestions to improve this topic
In short, we really thank to all people are a part of our achievement
Ho Chi Minh city, Friday, December 23, 2022
Student performance Pham Minh Quan Nguyen Hoai Phuong Uyen
Trang 11iii
ABSTRACT
Currently, both the national and international industries are growing quickly Manufacturers are interested in the trend of industry combined with automation With the advancement of digital technology, automatic lines are becoming more widely used
in manufacturing Manufacturing companies are constantly improving their own technology and machinery systems in order to produce high-quality products at the most competitive prices That is the foundation for improving competitive position and assisting businesses to stand firm in a competitive market The following article will delve deeper into the role of automation in modern manufacturing
It is because the logistics industry provides numerous benefits, such as increased productivity, lower kernel costs, improved product quality, and lower raw material costs
Our team decided to implement the topic "Automatic parcels classification and delivery through image processing" after absorbing information and researching the automation industry The convolution neural network used which is one of the most widely used networks in the field right now for character recognition, barcode recognition Generalizing the problem, we choose convolutional neural network because it has a complex architecture and large parameters, good enough for object recognition We additionally prepare a sizable enough data collection, which includes a few particular examples, to provide the training procedure with excellent outcomes In order to maximize data inference and training time, we additionally chosen to use NVIDIA's Jetson Nano hardware to maximize the GPU's processing capability JPG files with the collected data are used for both testing and training
Trang 12iv
TABLE OF CONTENTS
LIST OF PICTURES vi
LIST OF TABLES viii
ABBREVIATIONS ix
CHAPTER1: OVERVIEW 1
1.1 Introduction 1
1.2 Objective 2
1.3 Limitation 2
1.4 Research Method 2
1.5 Object and Scope of the study 2
1.5.1 Object of the study 3
1.5.2 Scope of the study 3
1.6 Outline 4
CHAPTER 2: BACKGROUND 6
2.1 AI Technology 6
2.1.1 Overview of CNN 6
2.1.2 Yolo network 9
2.1.3 Yolov7 12
2.1.4 OCR Theory 16
2.1.5 Tesseract model 17
2.2 Barcode Technology 18
2.2.1 Introduction to barcode 18
2.2.2 Barcode types 19
2.2.3 The methods of Barcode scanning 20
2.2.4 Code 128 22
2.3 The overview of AGV 24
2.3.1 The introduction of AGV 24
2.2.2 The fundamental architecture of an AGV system 25
2.4 PYQT5 Platform 29
2.5 Firebase 31
2.5.1 Introduction to Firebase 31
2.5.2 Some features of Firebase 32
2.5.3 The pros and cons of firebase 33
2.6 Other techniques used in the project 33
2.6.1 The working principle of the infrared sensor circuit in the vehicle's line detector 33 2.6.2 Pulse width modulation (PWM) 34
2.6.3 General operating principles of Automatic Traction Robot 35
Trang 13v
2.6.4 The method for establishing the robot's location in relation to the line 36
2.6.5 Serial Peripheral Interface (SPI) 39
CHAPTER 3: DESIGN AND IMPLEMENTATION 43
3.1 System requirements 43
3.2 Block diagram 43
3.3 AI System 45
3.3.1 Hardware Design 45
3.3.2 Detail Software Design 47
3.4 AGV SYSTEM 54
3.4.1 Machenical design 55
3.4.1 The detail hardware design 60
3.4.2 The schematic diagram of AGV system 71
3.4.3 Software Design 72
3.5 User interface of the delivery application 74
3.6 Firebase Realtime database 75
CHAPTER 4: RESULT 77
4.1 Introduction: 77
4.2 Hardware implementation 77
4.3 System Operation 80
4.4 Software System 82
4.4.1 The barcode generation process 82
4.4.2 Result of labelling process 83
4.4.3 Annotations for data 84
4.4.4 Training process 85
4.4.5 The text detecting process by Tesseract OCR model 86
4.4.6 Interface of delivery app 87
4.5 Evaluation and comparison 88
4.5.1 Comparison Yolo with others CNN 88
4.5.2 Comparison between Yolo and others CNN 92
CHAPTER 5: CONCLUSION AND FUTURE WORK 94
5.1 Conclusion 94
5.2 Future Work 94
APPENDIX 95
REFERENCE 95
Trang 14vi
LIST OF PICTURES
Figure 2.1: Architecture of CNN network 6
Figure 2.2: This is an image of sliding kernel through input’s matrix 7
Figure 2.3: Architecture of Yolo network 10
Figure 2.4: Identify anchor box of an object 11
Figure 2.5: The active way of YOLO 12
Figure 2.6: Architecture of E-ELAN 14
Figure 2.7: Architecture of Compound Model Scaling in YOLOv7 15
Figure 2.8: Architecture of Planned Re-parameterized Convolution 16
Figure 2.9: Architecture of Coarse for Auxiliary and Fine for Lead Loss 16
Figure 2.10: CCD Scanner 21
Figure 2.11: Laser Scanner 21
Figure 2.12: Read barcodes with Camera Software 22
Figure 2.13: Code 128 23
Figure 2.14: The part of a Code 128 Barcode 24
Figure 2.15: AGV vehicle operation diagram 25
Figure 2.16: The basic structure of an AGV system 25
Figure 2.17: Towing type 27
Figure 2.18: Cargo type 28
Figure 2.19:Forklift 28
Figure 2.20: Inference of Qt Designer Software 29
Figure 2.21: Image of QMainWindow 31
Figure 2.22: Firebase Database 32
Figure 2.23: Working principle 34
Figure 2.24: Principle diagram of infrared sensor 34
Figure 2.25: Time diagram of the PWM pulse 35
Figure 2.26: General structure of line detection robot 36
Figure 2.27: The robot is in the middle of the line 36
Figure 2.28: The robot is moving to the right level 1 37
Figure 2.29: The robot is moving to the right level 2 37
Figure 2.30: Robot turns left 38
Figure 2.31: Sensor deviating from line 38
Figure 2.32:Communication between 1 master and 1 salve 39
Figure 2.33: Independent mode in SPI protocol 40
Figure 2.34: Daisy mode in SPI protocol 40
Figure 2.35: SPI protocol operation mode 41
Figure 2.36: The communication process between master and salve uses SPI protocol 42
Figure 3.1: Block diagram of automatic classification and transformation system 44
Figure 3.2: The detailed block diagram of the AI system 45
Figure 3.3: Top view of Jetson Nano 46
Figure 3.4: Top view of Logitech C310 HD Webcam 47
Figure 3.5: The pipeline of AI system 48
Figure 3.6: Flowchart of generating bar code 49
Figure 3 7: Pipeline of pre-processing image 50
Figure 3.8: Format to export dataset 50
Figure 3.9: Pipeline of training data using Yolov7 model 51
Figure 3.10: Command Line to train dataset on Google Colab 52
Figure 3.11: Command Line to test image from dataset 52
Figure 3.12: Pipeline of processing of Tesseract 53
Figure 3.13: The flowchart of detecting barcode 54
Trang 15vii
Figure 3.14: Top view of a V1 reducer wheel 56
Figure 3.15: Coordinate system of robot AGV 56
Figure 3.16: Model of forces acting on the wheel 57
Figure 3.17: Dual Shaft Plastic Geared TT Motor 58
Figure 3.18: Calculation model and force analysis when the car is cornering 59
Figure 3.19: The 3D design of robot AGV 60
Figure 3.20: The block diagram of AGV system 61
Figure 3.21: The topview of ESP32 63
Figure 3.22: RFID MFRC 522 Module 64
Figure 3.23: Line Detection and Obstacle Avoidance Sensor 5 LED BFD-1000 65
Figure 3.24: Top view of the SG90 Servo Motor 66
Figure 3.25: The detail schematic for buzzer block 67
Figure 3.26: Top view of module L298N driver motor 68
Figure 3.27: Top view of V1 geared DC motor 68
Figure 3.28: Top view of Lithium-ion 18650 battery 69
Figure 3.29: Top view of LM2586HVS 3A DC to DC Step Down Buck Converter 70
Figure 3.30: The schematic of power supply block 70
Figure 3.31: The schematic diagram of AGV system 71
Figure 3.32: The operating diagram of AGV system 72
Figure 3.33: The flowchart of main program 73
Figure 3.34: The operating diagram of delivery application 74
Figure 3.35: The database of system ……… 75
Figure 4.1: An automatic parcel classify and delivery system model 77
Figure 4.2: AI system model 78
Figure 4.3: Model of vehicle system AGV 79
Figure 4.4: Line structure 80
Figure 4.5: The shape of the cargo block 80
Figure 4.6: AI system successfully recognizes and predicts barcodes 81
Figure 4.7: AGV vehicle at the delivery location 82
Figure 4.8: AGV vehicle returns to the starting position 82
Figure 4.9: Result of generating bar-code folder process 83
Figure 4.10: Result of each bar code in Code128 format 83
Figure 4.11: Result of labelling each image 84
Figure 4.12: Result of add annotation process 85
Figure 4.13: Yolov7 dataset training results were successful 86
Figure 4.14: The model's performance when tested with the input image 86
Figure 4.15: Accuracy of input image after testing process 87
Figure 4.16: The interface of delivery app 88
Figure 4.17: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on CPU 89 Figure 4.18: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on GPU 89 Figure 4.19: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on TELSA P100 90
Figure 4.20: Comparison of AP and inference time between Yolov7 models with others Yolo’s different version models 92
Trang 16viii
LIST OF TABLES
Table 2.1: Describe the types of 1D barcodes used in industries 19
Table 2.2: The pros and cons of firebase 33
Table 3.1: Specifications Dual Shaft Geared Plastic TT Motor 58
Table 3.2: Power consumption esitmate of AGV system 70
Table 4.1: Comparison of mAP and FPS between Yolov7 with others version of Yolo 91
Table 4.2: Comparison between Yolo and others CNN 92
Trang 17ix
LIST OF ABBREVIATIONS
Trang 18is essential to overcome the remaining limitations in small and medium enterprises in Vietnam
Artificial Intelligence (AI) is a global technology trend, attracting investment from businesses in its application to business, production and management processes If any business knows how to make the most of the superior benefits of AI technology, they will certainly have a strong advantage in the growth race of the digital transformation era One of the breakthroughs of AI is Deep Learning whereby businesses can easily apply artificial intelligence (AI) to solve problems in life
Due to the outstanding advantages of Deep Learning compared to current algorithms, our team decided to apply Deep Learning algorithms to the model to implement automatic parcel classification to optimize the product classification process products, reduce the rate of errors caused by human causes, reduce labor costs, shorten delivery times in the field of logistics That can help us better understand how to operate
Trang 19• Building AGV (Automatically Guided Vehicle) controlled by an AI system through WIFI connection to transport parcels to each fixed compartment
• In addition, building tracking software displays the location of autonomous vehicles, the location of the dispatch box, and the number of parcels in each dispatch box
1.3 Limitation
The project has the following limitation:
• The topic mainly focuses on identifying and classifying parcels with code 128 and identifying code 128 on packages performed in good light conditions with a direct shooting angle
• The parcels for classification in the subject are small volumes
• The AGV vehicle system for product sorting has a low load capacity and is implemented in indoor conditions away from direct sunlight
• The tracking software can only use on computers
1.4 Research Method
Analysis and evaluation of energy efficiency, processing speed, and performance
on embedded systems of neural network models in barcode recognition
Learn the parameters of the neural network model, then design the network model
to train the system to execute barcode recognition
Analyze and evaluate the system's functions, then select the hardware for the AI system and the AGV (Automatically Guided Vehicle)
1.5 Object and Scope of the study
Trang 203
1.5.1 Object of the study
To make it easier to approach the problems, the group researched the research subjects to understand better how the topic was implemented Below are the subjects the group conducted the study:
• Nvidia Jetson Nano Developer Kit for AI Application Deployment Hardware: The product is a small but potent embedded computer that allows running modern
AI algorithms quickly, with a 64-bit ARM quad-core CPU, an onboard 128-core NVIDIA GPU, and 4GB LPDDR4 memory It is possible to run multiple neural networks in parallel and process several high-resolution sensors simultaneously
• Neural Network:
Yolo (You Look Only Once) and OCR (Optical Character Recognition)
• MCU ESP32 for AGV vehicle system control device for parcel sorting:
The product is a wifi transceiver KIT based on the ESP32 Wifi SoC chip and the powerful CP2102 communication chip Used for applications that need to connect, collect data, and control via Wifi waves, especially applications related to IoT
• PyQT5 framwork for build monitoring and operating programs:
Qt is a cross-platform application framework developed in the C++ programming language that is used to create desktop, embedded and mobile apps Linux, OS X, Windows, VxWorks, QNX, Android, iOS, BlackBerry, Sailfish OS, and many other platforms are supported The Python interface for the Qt library, which is a collection
of control interface components, is called PyQt (widgets, graphical control elements)
• Firebase realtime database for communication between Jetson Nano, ESP 32, monitoring software via WIFI connection:
Firebase is a google-owned platform that helps us develop web and mobile apps They provide a lot of useful tools and services to develop a quality application That shortens the development time and helps the app to be available to users soon Firebase real-time database is a cloud-hosted, NoSQL real-time database that allows you to store and sync data The data is stored as a JSON tree and is synchronized in real-time for all connections
1.5.2 Scope of the study
Trang 214
The topic is limited in scope following the purpose of the topic In this report, the team will analyze the advantages of applying product classification according to Barcode using Deep Learning compared to traditional methods in the form of hardware descriptive design analysis At the same time, the group will implement a model to perform automatic parcel classification, including:
• An AI system with the function of recognizing barcodes and classifying barcodes according to each corresponding compartment
• An AGV vehicle system controlled by an AI system through WIFI connection
to deliver parcels to each respective compartment
• A tracking and operating software display the location of autonomous vehicles, the number of parcels in receiving location, and the name of receiving location
1.6 Outline
In the report, the research team has tried to present it logically so that readers can easily understand the knowledge, method, and operation of the topic The layout of the report is divided into five chapters as follows:
Chapter 1: In this chapter, the group will present the current research status and
development trends of artificial intelligence At the same time, we raise the urgency of the topic of applying artificial intelligence in the classification of goods in the field of Logistics From there, implement an automatic parcel sorting system that applies AI technology to solve the limitations of manual product classification Finally, the group will set out the goal, audience, and scope of research to implement this system
Chapter 2: Background This chapter focuses on theories related to the topic,
including knowledge of neural network, electronic component and software used in the system
Chapter 3: Design and Implementation This chapter will present in detail the
model of the system, including the block diagram and the operating principle of the system Next is going to design the system, which module, electronic component and neural network model should be selected to achieve the highest efficiency and the connection diagram between those modules and components Finally, based on the system design, implement hardware and software construction for the system From there, the operating procedure of the system is given
Trang 225
Chapter 4: Results This chapter will present the implementation results and make
comments and evaluations with the theory presented in Chapter 2
Chapter 5: Conclusion and future work This chapter summarizes what has been
done and the limitations and evaluates the system so that solutions and new development directions can be given for the topic
Trang 236
CHAPTER 2: BACKGROUND
In this chapter, we will provide an overview of the technologies and methods employed in this field, including AI technology, the AGV vehicle system, barcodes, and others
2.1 AI Technology
2.1.1 Overview of CNN
The introduction of CNN:
A Convolutional Neural Network, or CNN, is a type of artificial neural network that
is widely used in Deep Learning for image/object recognition and classification Deep Learning recognizes objects in images by employing a CNN CNNs are important in a variety of tasks/functions such as image processing, computer vision tasks such as localization and segmentation, video analysis, recognizing obstacles in self-driving cars, and speech recognition in natural language processing CNNs are very popular in Deep Learning because they play a significant role in these rapidly growing and emerging areas
fully-Figure 2.1: Architecture of CNN network [1]
Input layer: As we know, CNN is inspired from ANN model, so its input is an
image which will hold the pixel values
Trang 247
Convolutional layer: through the calculation of the scalar product between their
weights and the area connected to the input volume, will be able to determine the output
of neurons whose local regions of the input are connected
Pooling layer: will just downscale the input along its spatial dimension,
significantly lowering the number of parameters that make up that activation
Fully connected layer: will then carry out the ANN's normal functions and try to
create class scores from the activations for categorization Additionally, it is proposed that ReLu be applied in between these layers to enhance performance The goal of the rectified linear unit, also known as ReLu, is to activate the output of the previous layer's activation by applying a "elementwise" activation function, such as sigmoid
Then, we will specifically analysis about convolutional layer, fully connected layer
Convolutional layer: The convolutional layer is crucial to how CNNs work, as its
name suggests The usage of learnable kernels is the main emphasis of the layer parameters
These kernels often have a low spatial dimension yet cover the entire depth of the input Each filter is involved across the spatial dimensions of the input by the convolutional layer when the data enters it, creating a 2D activation map
The input matrix is processed over a matrix called the kernel to create a feature map for the following layer We carry out convolution mathematical process by sliding the Kernel matrix over the input matrix Each position performs element-by-element matrix multiplication and sums the results onto the feature map
Figure 2.2: This is an image of sliding kernel through input’s matrix [2]
Trang 258
More than one axis can be applied to convolution The convoluted image is calculated as follows if we have a two-dimensional image input, I, and a two-dimensional kernel filter, K:
For example, if the network's input is an image of size 64x64x3 (a RGB colored image with a conditionality of 64x64) and the receptive field size is set to 6x6, each neuron in the convolutional layer would have a total of 108 weights (6x6x3 where 3 is the magnitude of connectivity across the volume's depth) To put this in context, a standard neuron in other forms of ANN would have 12, 288 weights
Convolutional layers can also significantly reduce the model's complexity through output optimization These are optimized using three hyper parameters: depth, stride, and zero-padding
The number of neurons within the layer to the same region of the input can be used
to manually set the depth of the output volume produced by the convolutional layers This can be seen in other types of ANNs, where all of the neurons in the hidden layer are previously directly connected to every single neuron Reducing this hyper parameter significantly reduces the total number of neurons in the network, but it also significantly reduces the model's pattern recognition capabilities
We can also define the stride, which is the depth we set around the spatial dimensional of the input to place the receptive field For example, if we set stride to 1,
we will have a heavily overlapped receptive field with extremely large activation Alternatively, increasing the stride will reduce the amount of overlapping and produce
an output with lower spatial dimensions
Zero-padding is the simple process of padding the input's border, and it is an
effective way to give more control over the dimensional of the output volumes
It is critical to understand that by employing these techniques, we will change the spatial dimensional of the output of the convolutional layers Below formula which is provided by the author to calculate for this:
(𝑉 − 𝑅) + 2𝑍
𝑆 + 1
Trang 269
Where V denotes the input volume size (height,width,depth), R the receptive field size, Z the amount of zero padding set, and S the stride If the calculated result of this equation is not a whole integer, the stride has been set incorrectly, and the neurons will not fit neatly across the given input
Despite our best efforts, if we use an image input of any real dimensional, our models will still be enormous However, methods for greatly reducing the overall number of parameters within the convolutional layer have been developed
The assumption behind parameter sharing is that if one region feature is useful to compute in one spatial region, it is likely to be useful in another
If we constrain each individual activation map within the output volume to the same weights and bias, the number of parameters produced by the convolutional layer will be drastically reduced
As a result, when the back propagation stage occurs, each neuron in the output will represent the overall gradient, which can be totaled across the depth, updating only a single set of weights rather than all of them
Pooling layer: The goal of pooling layers is to gradually reduce the dimensional of
the representation, and thus the number of parameters and computational complexity of the model The pooling layer runs over each activation map and uses the "MAX" function to scale its dimensionality Most CNNs use max-pooling layers with kernels of dimensionality 2 2 applied with a stride of 2 along the spatial dimensions of the input This reduces the activation map to 25% of its original size while keeping the depth volume at its original size Because of the pooling layer's destructive nature, there are only two commonly observed methods of max-pooling Typically, the pooling layers' stride and filters are both set
Fully connected layer: The neurons in the fully-connected layer are directly
connected to the neurons in the two adjacent layers, but not to any layers within them
2.1.2 Yolo network
The overview of Yolo
Yolo (You Only Look Once) is a CNN network model for object detection, classification, and recognition Yolo's convolutional layers and connected layers are
Trang 27The architecture of Yolo
According to the author, Yolo network is inspired from GoogleLenet model for image classification Their network consists of 24 convolutional layers, followed by two fully connected layers Instead of the GoogleLeNet inception modules, we simply use 1x1 reduction layers followed by 3x3 convolutional layers The entire network’s architecture is shown below
Figure 2.3: Architecture of Yolo network [3]
The author also trains a fast version of Yolo to test the limits of fast object detection Fast Yolo employs a neural network with fewer convolutional layers (9 as opposed to 24) and filters in those layers Except for the network size, all training and testing parameters are the same between Yolo and Fast Yolo
Trang 2811
The final output is 7x7x30 tensors
Anchor box: YOLO will need the anchor boxes as the basis of the estimation to
find the bounding box for the object These anchor boxes will be predefined and will closely surround the object The anchor box will be refined later by the regression bounding box algorithm to create a predicted bounding box for the object Each object
in the training image is distributed around an anchor box in a YOLO model If there are two or more anchor boxes surrounding the object, we will choose the one with the highest IoU with the ground truth bounding box
Figure 2.4: Identify anchor box of an object [3]
Each object in the training image is assigned to a cell on the feature map that contains the object's midpoint So, in order to identify an object, we must first identify two components associated with it (cell, anchor box) It's not just the cell or the anchor box
Bounding box: Each grid cell forecasts the B bounding boxes, every bounding box
has five predictions: x, y, w, h, and confidence The (x, y) coordinates represent the center of the box in relation to the grid cell bounds The width and height are calculated
in relation to the entire image Finally, the IOU between the predicted box and any ground truth box is represented by the confidence prediction
In addition, each grid cell predicts C conditional class probabilities, Pr(Classi | Object) These probabilities are dependent on the grid cell that contains an object Regardless of the number of boxes B, we predict only one set of class probabilities per grid cell We multiply the conditional class probabilities by the individual box confidence predictions at test time
Trang 2912
This provides us with confidence scores for each box based on its class These scores encode both the likelihood of that class appearing in the box and the accuracy with which the predicted box fits the object
Figure 2.5: The active way of YOLO [3]
2.1.3 Yolov7
The Yolov7’s theory:
The YOLO (You Only Look Once) v7 model is the most recent addition to the YOLO family YOLO models are object detectors with a single stage Image frames in
a YOLO model are characterized by a backbone These features are combined and mixed in the neck before being transmitted to the network's head YOLO predicts the locations and classes of objects that should have bounding boxes drawn around them Yolov7 outperforms all known object detectors in terms of both speed and accuracy
in the 5 FPS to 160 FPS range, and has the highest accuracy (56.8% AP) among all known real-time object detectors about 30 FPS on GPU V100
The architecture of Yolov7:
Trang 3013
YOLOv4, Scaled YOLOv4, and YOLO-R were used to create the architecture Using these models as a foundation, additional experiments were carried out in order to develop new and improved YOLOv7
Yolov7 performs the same recognition as previous Yolo versions, but it is faster and has a shorter inference time Generally, Yolov7 is designed much convolution layer than others version Furthermore, Yolov7’s architecture also has been different compare with the previous architecture When designing a network architecture, researchers commonly prioritize fundamental requirements such as: the number of parameters, amount of computation, and computational density is lower than before
In this version, author not only bases on the following conditions, but also considers the number of element on the convolution layer output tensors Since then, the author has created the CSPVoVNet network, which was inspired by the previous VoVNet network
E-ELAN (Extended Efficient Layer Aggregation Network): The YOLOv7
backbone's computational block is the E-ELAN It is inspired by previous network efficiency research It was created by analyzing the following factors that influence speed and accuracy:
• Memory access fees
• The ratio of I/O channels
• Operation by element
• Activation
• The gradient path
Simply put, the E-ELAN architecture allows the framework to learn more effectively It's built around the ELAN computational block At the time of writing, the ELAN paper had not yet been published When ELAN information becomes available,
we will update the post
Trang 3114
Figure 2.6: Architecture of E-ELAN [4]
Compound Model Scaling in YOLOv7: Different models are required for
different applications Some require highly accurate models, while others prioritize speed Model scaling is done to meet these requirements and fit it into different computing devices
The following parameters are taken into account when scaling a model size:
• Resolution ( size of the input image)
• Width (number of channels)
• Depth (number of layers)
• Stage (number of feature pyramids)
A common model scaling method is NAS (Network Architecture Search) Researchers use it to iterate through the parameters in order to find the best scaling factors However, methods such as NAS perform parameter-specific scaling In this case, the scaling factors are unrelated
The YOLOv7 paper's authors demonstrate that it can be further optimized using a compound model scaling approach For concatenation-based models, width and depth are scaled in coherence
Trang 3215
Figure 2.7: Architecture of Compound Model Scaling in YOLOv7 [4]
Trainable Bag of Freebies in YOLOv7:
Planned Re-parameterized Convolution
Averaging a set of model weights is used in re-parameterize techniques to create a model that is more robust to the general patterns that it is attempting to model There has recently been a focus in research on module level re-parameterize, where each component of the network has its own re-parameterize strategy
The YOLOv7 authors use gradient flow propagation paths to determine which network modules should and should not use re-parameterize strategies
Trang 3316
Figure 2.8: Architecture of Planned Re-parameterized Convolution [4]
The RepConv layer replaces the E-ELAN computational block's 33 convolution layer in the diagram above We conducted experiments by switching or replacing the positions of RepConv, 33 Conv, and Identity connection It is simply an 11 convolutional layer We can see which configurations work and which do not More information about RepConv can be found in the RepVGG paper
In addition to RepConv, YOLOv7 re-parameterized Conv-BN (Convolution Batch Normalization), OREPA (Online Convolutional Re-parameterize), and YOLO-R to achieve the best results
Coarse for Auxiliary and Fine for Lead Loss
The YOLO network head makes the final network predictions, but because it is so far downstream in the network, it may be advantageous to add an auxiliary head somewhere in the middle You are supervising both this detection head and the head that will actually make predictions while training
Because there is less network between the auxiliary head and the prediction, it does not train as efficiently as the final head - so the YOLOv7 authors experiment with different levels of supervision for this head, settling on a coarse-to-fine definition where supervision is passed back from the lead head at different granularity
Figure 2.9: Architecture of Coarse for Auxiliary and Fine for Lead Loss [4]
2.1.4 OCR Theory
The Overview of OCR:
The use of technology to distinguish printed or handwritten text characters within digital images of physical documents, such as a scanned paper document, is known as OCR (optical character recognition) The fundamental process of OCR is to examine the text of a document and translate the characters into code that can be used for data processing Text recognition is another term for optical character recognition (OCR)
The Principle work of OCR:
Trang 34The dark areas are then further processed to look for alphabetic letters or numeric digits OCR programs use a variety of techniques, but most focus on one character, word, or block of text at a time Following that, characters are identified using one of two algorithms:
Recognition of patterns: OCR programs are fed text examples in various fonts and
formats, which are then compared and recognized in the scanned document
Detection of features: To recognize characters in a scanned document, OCR
programs use rules based on the characteristics of a specific letter or number For comparison, features could include the number of angled lines, crossed lines, or curves
in a character For example, the capital letter "A" could be represented by two diagonal lines intersected by a horizontal line in the middle
2.1.5 Tesseract model
Text recognition is a difficult task in computer vision that has a lot of practical applications Optical character recognition (OCR) enables a variety of automation applications This project focuses on natural image word detection and recognition The targeted problem is significantly more difficult than reading text in scanned documents Because of the limited availability of images, the use case in focus makes it possible to detect the text area in natural scenes with greater accuracy This is accomplished by mounting a camera on a truck and continuously capturing similar images The Tesseract OCR engine is then used to recognize the detected text area
Line Finding:
The line finding algorithm is designed to recognize a skewed page without having
to de-skew it, saving image quality Blob filtering and line construction are critical steps
in the process Assuming that page layout analysis has already provided text regions of roughly uniform text size, a simple percentile height filter removes drop-caps and characters that are vertically touching Because the median height approximates the text
Trang 35non-BaseLine Fitting:
The blobs are partitioned into groups with a reasonably continuous displacement for the original straight baseline to fit the baselines A least squares fit is used to fit a quadratic spline to the most populous partition (assumed to be the baseline) The quadratic spline has the advantage of being reasonably stable in this calculation, but it has the disadvantage of causing discontinuities when multiple spline segments are required A more traditional cubic spline might be preferable
Chopping and Fixed Pitch Detection:
Tesseract examines the text lines to see if they are fixed pitch When Tesseract encounters fixed pitch text, it chops the words into characters based on the pitch and disables the chopper and associator on these words for the word recognition step
2.2 Barcode Technology
2.2.1 Introduction to barcode
Nowadays, automation in production and management has become a leading trend not only in each country but also in the whole world The use of automatic data acquisition (ADC) technology, in general, and barcode technology, in particular, has brought many obvious benefits in commerce and management One of the most apparent benefits is that inventory, payment, and export management are carried out quickly and accurately Barcodes are more widely used than other ADC technologies because of their economic advantages and high efficiency
Trang 3619
Norman Joseph Woodland and Bernard Silver developed the idea of barcode technology In 1984, students at Drexel University developed this idea after learning what a food company president wanted to ask his employees to manage To be able to test the entire process automatically
and check consumer goods, retail businesses, food technology, all over the world
Currently, UPC Codes are commonly used for North American countries and Canada
UPC codes are used in the field of code only;
alphanumeric code, and they are reliable UPC codes have codes for error checking
Code
128
Code 128 is applied in the distribution of goods in logistics and transportation, retail supply chain, and manufacturing industry
Code 128 is highly appreciated and famous for its application because it has advantages such as compact barcode,
Trang 3720
diverse information storage, and can
characters: uppercase, lowercase, characters numbers, standard ASCII characters, and control codes
similarities to UPC code and
is commonly used in European countries
EAN codes are used in the field where only codes are required, no alphanumeric codes are needed, and they are reliable Code with code for error checking
Ministry of National Defense, the Health sector, administrative agencies, and book publishing
Code 39 type of code overcomes the biggest drawback of the above two types of EAN and UPC barcodes, which
is unlimited capacity and can encode both uppercase characters, natural numbers, and some characters
2.2.3 The methods of Barcode scanning
Currently, there are many barcode recognition devices, and each device will have different identification methods, all of which recognize the barcode However, no
Trang 38• Advantages: low cost
• Cons: This type can only scan barcodes on flat surfaces at close range, not barcodes in curves
Laser Scanner:
Figure 2.11: Laser Scanner
Trang 3922
Laser scanners consist of a reader that emits a red laser and then uses a reflector to create a light trail that cuts across the surface of the barcode and does not use a light-collecting lens
• Advantages: no need for light collection, very sensitive laser scanning, high accuracy results, can scan barcodes on work surfaces, and long-range scanning capabilities
• Cons: The reading eye is not durable; after a while, it may be weakened due
to the phenomenon of a 'cocoon of barcodes
Read barcodes with Camera Software:
Figure 2.12: Read barcodes with Camera Software
The use of cameras is of great interest and is mainly used for applications that run
on smartphones and jobs that need to handle multiple barcodes simultaneously Use a high-resolution camera and autofocus to take input images and process them using pre-programmed software
• Advantages: Giving users a more intuitive view of the processing and reading
of barcodes on images, a sensitive camera, good focus, and accuracy Process multiple barcodes simultaneously Highly portable, suitable for use on small and compact mobile devices
• Cons: Affected by ambient light, the resolution and focus of the camera must
be high and appropriate, the ability to read barcodes on curved surfaces is poor, and the reading distance is not far
2.2.4 Code 128
Introduction to Code 128
Trang 4023
Code 128 is a high-density symbology that encodes alphanumeric information It offers security against tampering by means of a checksum digit and byte parity checking This symbology has been extensively used in a variety of applications where a large amount of data must be represented in a compact area Its unusual shape also facilitates double-density encoding of numerical data
Figure 2.13: Code 128
The Part of a Code 128 barcode
A Code 128 barcode consists of an initial "quiet zone," one of three start codes, data, a check character, a stop character, and an additional quiet zone
The silent zone is the unmarked region between the bars and spaces that allows scanners to establish baseline values for the color and reflectance of the object being scanned These data are used to determine on the fly what constitutes a "space" and what constitutes a "bar."
The start code is one of three codes indicating the start of a Code 128 barcode The Code 128 standard defines three different "character sets" or "character modes." The Start-A, Start-B, and Start-C codes specify the character set to be utilized A barcode's character set may be adjusted to encode data more efficiently
Data in Code 128 is encoded in bars and spaces In the table below, the encoding is represented using a binary system, with a "1" indicating a single-width bar and a "0" representing a space Sequences of ones or zeros appear as thicker bars or spaces The following are the actual procedures for calculating the check digit in Code 128:
Step1: Make the first character's value (103, 104, or 105) the initial value of the running checksum