Design and implementation of classfication and delivery based on computer vision

MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION FACULTY FOR HIGH QUALITY TRAINING TRƯƠNG QUANG PHÚC, M.Eng PHẠM MINH QUÂN NGUYỄN HOÀI PHƯƠNG

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION

FACULTY FOR HIGH QUALITY TRAINING

TRƯƠNG QUANG PHÚC, M.Eng PHẠM MINH QUÂN

NGUYỄN HOÀI PHƯƠNG UYÊN

DESIGN AND IMPLEMENTATION OF

CLASSFICATION AND DELIVERY BASED

ON COMPUTER VISION

SKL 0 0 9 6 9 7

Trang 2

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND

EDUCATION FACULTY FOR HIGH QUALITY TRAINING

Student ID: 18119053

Major: COMPUTER ENGINEERING TECHNOLOGY

Ho Chi Minh City, December 2022

Trang 3

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND

EDUCATION FACULTY FOR HIGH QUALITY TRAINING

GRADUATION PROJECT

DESIGN AND IMPLEMENTATION OF CLASSIFICATION AND DELIVERY BASED ON

COMPUTER VISION

Student ID: 18161031

NGUYỄN HOÀI PHƯƠNG UYÊN

Student ID: 18119053

Major: COMPUTER ENGINEERING TECHNOLOGY

Advisor: TRƯƠNG QUANG PHÚC, M.Eng

Ho Chi Minh City, December 2022

Trang 4

THE SOCIALIST REPUBLIC OF VIETNAM

Independence – Freedom– Happiness

-

Ho Chi Minh City, December 25, 2022

GRADUATION PROJECT ASSIGNMENT

Student name: Phạm Minh Quân Student ID: 18161031

Student name: Nguyễn Hoài Phương Uyên Student ID: 18119053

Major: Computer Engineering Technology Class: 18119CLA

Advisor: Trương Quang Phúc, MEng Phone number: _ Date of assignment:

_

Date of submission: _

1 Project title: Design and Implementation of classification and delivery based on Computer Vision

2 Initial materials provided by the advisor: _

3 Content of the project: _

4 Final product:

CHAIR OF THE PROGRAM

(Sign with full name)

ADVISOR

Trương Quang Phúc

Trang 5

-

ADVISOR’S EVALUATION SHEET

Major: Computer Engineering Technology

Project title: Design and Implementation of classification and delivery based on Computer Vision

Advisor: Trương Quang Phúc, MEng

EVALUATION

1 Content of the project:

2 Strengths:

3 Weaknesses:

4 Approval for oral defense? (Approved or denied)

Approved

5 Overall evaluation: (Excellent, Good, Fair, Poor)

Good

6 Mark: 9.0 (in words: .)

ADVISOR

Trương Quang Phúc

Trang 6

HO CHI MINH CITY UNIVERSITY

TECHNOLOGY AND EDUCATIO

FACULTY FOR HIGH QUALITY TRAINING

THE SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

Ho Chi Minh City, January 13, 2023

MODIFYING EXPLANATION OF THE GRADUATION PROJECT

MAJOR: COMPUTER ENGINEERING

1 Project title: Design and Implementation of classification and delivery based on Computer

Vision

2 Student name: Phạm Minh Quân Student ID: 18161031

3 Advisor: Trương Quang Phúc, Meng

4 Defending Council: Council 2, Room: A3-404, 3rd January 2023

5 Modifying explanation of the graduation project:

TT Council comments Editing results Note

1 Many figures in chapter 2 are reused from other

sources without providing the related references

Many figures in chapter 2 are provided the related references

2 Visual quality of many figures in chapter 3 is

very low and hard to follow

Many figures in chapter 3 are modified to improve their visual quality

3

In conclusion: Author should clearly point out

which objective have accomplished instead of a

general summarization

The conclusion section is clearly pointed out which objective have accomplished

4 The flowchart of figure 3.6 must have a “Begin”

point and a “End” Point in a terminator shape The flowchart of figure 3.6 is modified

Trang 7

-

PRE-DEFENSE EVALUATION SHEET

Project title: Name of Reviewer:

EVALUATION

1 Content and workload of the project

2 Strengths:

3 Weaknesses:

4 Approval for oral defense? (Approved or denied)

Trang 8

-EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER

Project title:

Name of Defense Committee Member:

EVALUATION

1 Content and workload of the project

2 Strengths:

3 Weaknesses:

4 Overall evaluation: (Excellent, Good, Fair, Poor)

Trang 9

i

SUPERVISOR APPROVAL

Trang 10

ii

ACKNOWLEDGEMENTS

Over the course of undertaking the project, our group received a plenty of valuable support that incentivize us to overcome all the problems and challenges and end up this quite hard and meaningful project

Firstly, we would like to thanks to the School Board of the Ho Chi Minh City University

of Technology and Education and Faculty for High Quality Training creating wonderful conditions for me to take my project

Secondly, sincerely thank to Mr.Trương Quang Phúc, our advisor who gave us useful guidance and instruction that help us to finish our project successfully From these advices we can improve our project contents and correct the mistakes as well

Thirdly, we are grateful to all of the nice classmates of class 18119CLA who contributed

to give advice and warm guidance whenever we need the support

Last but not least, Due to limited knowledge and implementation time, we cannot avoid errors We look forward to receiving your comments and suggestions to improve this topic

In short, we really thank to all people are a part of our achievement

Ho Chi Minh city, Friday, December 23, 2022

Student performance Pham Minh Quan Nguyen Hoai Phuong Uyen

Trang 11

iii

ABSTRACT

Currently, both the national and international industries are growing quickly Manufacturers are interested in the trend of industry combined with automation With the advancement of digital technology, automatic lines are becoming more widely used

in manufacturing Manufacturing companies are constantly improving their own technology and machinery systems in order to produce high-quality products at the most competitive prices That is the foundation for improving competitive position and assisting businesses to stand firm in a competitive market The following article will delve deeper into the role of automation in modern manufacturing

It is because the logistics industry provides numerous benefits, such as increased productivity, lower kernel costs, improved product quality, and lower raw material costs

Our team decided to implement the topic "Automatic parcels classification and delivery through image processing" after absorbing information and researching the automation industry The convolution neural network used which is one of the most widely used networks in the field right now for character recognition, barcode recognition Generalizing the problem, we choose convolutional neural network because it has a complex architecture and large parameters, good enough for object recognition We additionally prepare a sizable enough data collection, which includes a few particular examples, to provide the training procedure with excellent outcomes In order to maximize data inference and training time, we additionally chosen to use NVIDIA's Jetson Nano hardware to maximize the GPU's processing capability JPG files with the collected data are used for both testing and training

Trang 12

iv

TABLE OF CONTENTS

LIST OF PICTURES vi

LIST OF TABLES viii

ABBREVIATIONS ix

CHAPTER1: OVERVIEW 1

1.1 Introduction 1

1.2 Objective 2

1.3 Limitation 2

1.4 Research Method 2

1.5 Object and Scope of the study 2

1.5.1 Object of the study 3

1.5.2 Scope of the study 3

1.6 Outline 4

CHAPTER 2: BACKGROUND 6

2.1 AI Technology 6

2.1.1 Overview of CNN 6

2.1.2 Yolo network 9

2.1.3 Yolov7 12

2.1.4 OCR Theory 16

2.1.5 Tesseract model 17

2.2 Barcode Technology 18

2.2.1 Introduction to barcode 18

2.2.2 Barcode types 19

2.2.3 The methods of Barcode scanning 20

2.2.4 Code 128 22

2.3 The overview of AGV 24

2.3.1 The introduction of AGV 24

2.2.2 The fundamental architecture of an AGV system 25

2.4 PYQT5 Platform 29

2.5 Firebase 31

2.5.1 Introduction to Firebase 31

2.5.2 Some features of Firebase 32

2.5.3 The pros and cons of firebase 33

2.6 Other techniques used in the project 33

2.6.1 The working principle of the infrared sensor circuit in the vehicle's line detector 33 2.6.2 Pulse width modulation (PWM) 34

2.6.3 General operating principles of Automatic Traction Robot 35

Trang 13

v

2.6.4 The method for establishing the robot's location in relation to the line 36

2.6.5 Serial Peripheral Interface (SPI) 39

CHAPTER 3: DESIGN AND IMPLEMENTATION 43

3.1 System requirements 43

3.2 Block diagram 43

3.3 AI System 45

3.3.1 Hardware Design 45

3.3.2 Detail Software Design 47

3.4 AGV SYSTEM 54

3.4.1 Machenical design 55

3.4.1 The detail hardware design 60

3.4.2 The schematic diagram of AGV system 71

3.4.3 Software Design 72

3.5 User interface of the delivery application 74

3.6 Firebase Realtime database 75

CHAPTER 4: RESULT 77

4.1 Introduction: 77

4.2 Hardware implementation 77

4.3 System Operation 80

4.4 Software System 82

4.4.1 The barcode generation process 82

4.4.2 Result of labelling process 83

4.4.3 Annotations for data 84

4.4.4 Training process 85

4.4.5 The text detecting process by Tesseract OCR model 86

4.4.6 Interface of delivery app 87

4.5 Evaluation and comparison 88

4.5.1 Comparison Yolo with others CNN 88

4.5.2 Comparison between Yolo and others CNN 92

CHAPTER 5: CONCLUSION AND FUTURE WORK 94

5.1 Conclusion 94

5.2 Future Work 94

APPENDIX 95

REFERENCE 95

Trang 14

vi

LIST OF PICTURES

Figure 2.1: Architecture of CNN network 6

Figure 2.2: This is an image of sliding kernel through input’s matrix 7

Figure 2.3: Architecture of Yolo network 10

Figure 2.4: Identify anchor box of an object 11

Figure 2.5: The active way of YOLO 12

Figure 2.6: Architecture of E-ELAN 14

Figure 2.7: Architecture of Compound Model Scaling in YOLOv7 15

Figure 2.8: Architecture of Planned Re-parameterized Convolution 16

Figure 2.9: Architecture of Coarse for Auxiliary and Fine for Lead Loss 16

Figure 2.10: CCD Scanner 21

Figure 2.11: Laser Scanner 21

Figure 2.12: Read barcodes with Camera Software 22

Figure 2.13: Code 128 23

Figure 2.14: The part of a Code 128 Barcode 24

Figure 2.15: AGV vehicle operation diagram 25

Figure 2.16: The basic structure of an AGV system 25

Figure 2.17: Towing type 27

Figure 2.18: Cargo type 28

Figure 2.19:Forklift 28

Figure 2.20: Inference of Qt Designer Software 29

Figure 2.21: Image of QMainWindow 31

Figure 2.22: Firebase Database 32

Figure 2.23: Working principle 34

Figure 2.24: Principle diagram of infrared sensor 34

Figure 2.25: Time diagram of the PWM pulse 35

Figure 2.26: General structure of line detection robot 36

Figure 2.27: The robot is in the middle of the line 36

Figure 2.28: The robot is moving to the right level 1 37

Figure 2.29: The robot is moving to the right level 2 37

Figure 2.30: Robot turns left 38

Figure 2.31: Sensor deviating from line 38

Figure 2.32:Communication between 1 master and 1 salve 39

Figure 2.33: Independent mode in SPI protocol 40

Figure 2.34: Daisy mode in SPI protocol 40

Figure 2.35: SPI protocol operation mode 41

Figure 2.36: The communication process between master and salve uses SPI protocol 42

Figure 3.1: Block diagram of automatic classification and transformation system 44

Figure 3.2: The detailed block diagram of the AI system 45

Figure 3.3: Top view of Jetson Nano 46

Figure 3.4: Top view of Logitech C310 HD Webcam 47

Figure 3.5: The pipeline of AI system 48

Figure 3.6: Flowchart of generating bar code 49

Figure 3 7: Pipeline of pre-processing image 50

Figure 3.8: Format to export dataset 50

Figure 3.9: Pipeline of training data using Yolov7 model 51

Figure 3.10: Command Line to train dataset on Google Colab 52

Figure 3.11: Command Line to test image from dataset 52

Figure 3.12: Pipeline of processing of Tesseract 53

Figure 3.13: The flowchart of detecting barcode 54

Trang 15

vii

Figure 3.14: Top view of a V1 reducer wheel 56

Figure 3.15: Coordinate system of robot AGV 56

Figure 3.16: Model of forces acting on the wheel 57

Figure 3.17: Dual Shaft Plastic Geared TT Motor 58

Figure 3.18: Calculation model and force analysis when the car is cornering 59

Figure 3.19: The 3D design of robot AGV 60

Figure 3.20: The block diagram of AGV system 61

Figure 3.21: The topview of ESP32 63

Figure 3.22: RFID MFRC 522 Module 64

Figure 3.23: Line Detection and Obstacle Avoidance Sensor 5 LED BFD-1000 65

Figure 3.24: Top view of the SG90 Servo Motor 66

Figure 3.25: The detail schematic for buzzer block 67

Figure 3.26: Top view of module L298N driver motor 68

Figure 3.27: Top view of V1 geared DC motor 68

Figure 3.28: Top view of Lithium-ion 18650 battery 69

Figure 3.29: Top view of LM2586HVS 3A DC to DC Step Down Buck Converter 70

Figure 3.30: The schematic of power supply block 70

Figure 3.31: The schematic diagram of AGV system 71

Figure 3.32: The operating diagram of AGV system 72

Figure 3.33: The flowchart of main program 73

Figure 3.34: The operating diagram of delivery application 74

Figure 3.35: The database of system ……… 75

Figure 4.1: An automatic parcel classify and delivery system model 77

Figure 4.2: AI system model 78

Figure 4.3: Model of vehicle system AGV 79

Figure 4.4: Line structure 80

Figure 4.5: The shape of the cargo block 80

Figure 4.6: AI system successfully recognizes and predicts barcodes 81

Figure 4.7: AGV vehicle at the delivery location 82

Figure 4.8: AGV vehicle returns to the starting position 82

Figure 4.9: Result of generating bar-code folder process 83

Figure 4.10: Result of each bar code in Code128 format 83

Figure 4.11: Result of labelling each image 84

Figure 4.12: Result of add annotation process 85

Figure 4.13: Yolov7 dataset training results were successful 86

Figure 4.14: The model's performance when tested with the input image 86

Figure 4.15: Accuracy of input image after testing process 87

Figure 4.16: The interface of delivery app 88

Figure 4.17: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on CPU 89 Figure 4.18: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on GPU 89 Figure 4.19: Comparison of mAP and FPS between Yolov7 with Yolov5, Yolov6 on TELSA P100 90

Figure 4.20: Comparison of AP and inference time between Yolov7 models with others Yolo’s different version models 92

Trang 16

viii

LIST OF TABLES

Table 2.1: Describe the types of 1D barcodes used in industries 19

Table 2.2: The pros and cons of firebase 33

Table 3.1: Specifications Dual Shaft Geared Plastic TT Motor 58

Table 3.2: Power consumption esitmate of AGV system 70

Table 4.1: Comparison of mAP and FPS between Yolov7 with others version of Yolo 91

Table 4.2: Comparison between Yolo and others CNN 92

Trang 17

ix

LIST OF ABBREVIATIONS

Trang 18

is essential to overcome the remaining limitations in small and medium enterprises in Vietnam

Artificial Intelligence (AI) is a global technology trend, attracting investment from businesses in its application to business, production and management processes If any business knows how to make the most of the superior benefits of AI technology, they will certainly have a strong advantage in the growth race of the digital transformation era One of the breakthroughs of AI is Deep Learning whereby businesses can easily apply artificial intelligence (AI) to solve problems in life

Due to the outstanding advantages of Deep Learning compared to current algorithms, our team decided to apply Deep Learning algorithms to the model to implement automatic parcel classification to optimize the product classification process products, reduce the rate of errors caused by human causes, reduce labor costs, shorten delivery times in the field of logistics That can help us better understand how to operate

Trang 19

• Building AGV (Automatically Guided Vehicle) controlled by an AI system through WIFI connection to transport parcels to each fixed compartment

• In addition, building tracking software displays the location of autonomous vehicles, the location of the dispatch box, and the number of parcels in each dispatch box

1.3 Limitation

The project has the following limitation:

• The topic mainly focuses on identifying and classifying parcels with code 128 and identifying code 128 on packages performed in good light conditions with a direct shooting angle

• The parcels for classification in the subject are small volumes

• The AGV vehicle system for product sorting has a low load capacity and is implemented in indoor conditions away from direct sunlight

• The tracking software can only use on computers

1.4 Research Method

Analysis and evaluation of energy efficiency, processing speed, and performance

on embedded systems of neural network models in barcode recognition

Learn the parameters of the neural network model, then design the network model

to train the system to execute barcode recognition

Analyze and evaluate the system's functions, then select the hardware for the AI system and the AGV (Automatically Guided Vehicle)

1.5 Object and Scope of the study

Trang 20

3

1.5.1 Object of the study

To make it easier to approach the problems, the group researched the research subjects to understand better how the topic was implemented Below are the subjects the group conducted the study:

• Nvidia Jetson Nano Developer Kit for AI Application Deployment Hardware: The product is a small but potent embedded computer that allows running modern

AI algorithms quickly, with a 64-bit ARM quad-core CPU, an onboard 128-core NVIDIA GPU, and 4GB LPDDR4 memory It is possible to run multiple neural networks in parallel and process several high-resolution sensors simultaneously

• Neural Network:

Yolo (You Look Only Once) and OCR (Optical Character Recognition)

• MCU ESP32 for AGV vehicle system control device for parcel sorting:

The product is a wifi transceiver KIT based on the ESP32 Wifi SoC chip and the powerful CP2102 communication chip Used for applications that need to connect, collect data, and control via Wifi waves, especially applications related to IoT

• PyQT5 framwork for build monitoring and operating programs:

Qt is a cross-platform application framework developed in the C++ programming language that is used to create desktop, embedded and mobile apps Linux, OS X, Windows, VxWorks, QNX, Android, iOS, BlackBerry, Sailfish OS, and many other platforms are supported The Python interface for the Qt library, which is a collection

of control interface components, is called PyQt (widgets, graphical control elements)

• Firebase realtime database for communication between Jetson Nano, ESP 32, monitoring software via WIFI connection:

Firebase is a google-owned platform that helps us develop web and mobile apps They provide a lot of useful tools and services to develop a quality application That shortens the development time and helps the app to be available to users soon Firebase real-time database is a cloud-hosted, NoSQL real-time database that allows you to store and sync data The data is stored as a JSON tree and is synchronized in real-time for all connections

1.5.2 Scope of the study

Trang 21

4

The topic is limited in scope following the purpose of the topic In this report, the team will analyze the advantages of applying product classification according to Barcode using Deep Learning compared to traditional methods in the form of hardware descriptive design analysis At the same time, the group will implement a model to perform automatic parcel classification, including:

• An AI system with the function of recognizing barcodes and classifying barcodes according to each corresponding compartment

• An AGV vehicle system controlled by an AI system through WIFI connection

to deliver parcels to each respective compartment

• A tracking and operating software display the location of autonomous vehicles, the number of parcels in receiving location, and the name of receiving location

1.6 Outline

In the report, the research team has tried to present it logically so that readers can easily understand the knowledge, method, and operation of the topic The layout of the report is divided into five chapters as follows:

Chapter 1: In this chapter, the group will present the current research status and

development trends of artificial intelligence At the same time, we raise the urgency of the topic of applying artificial intelligence in the classification of goods in the field of Logistics From there, implement an automatic parcel sorting system that applies AI technology to solve the limitations of manual product classification Finally, the group will set out the goal, audience, and scope of research to implement this system

Chapter 2: Background This chapter focuses on theories related to the topic,

including knowledge of neural network, electronic component and software used in the system

Chapter 3: Design and Implementation This chapter will present in detail the

model of the system, including the block diagram and the operating principle of the system Next is going to design the system, which module, electronic component and neural network model should be selected to achieve the highest efficiency and the connection diagram between those modules and components Finally, based on the system design, implement hardware and software construction for the system From there, the operating procedure of the system is given

Trang 22

5

Chapter 4: Results This chapter will present the implementation results and make

comments and evaluations with the theory presented in Chapter 2

Chapter 5: Conclusion and future work This chapter summarizes what has been

done and the limitations and evaluates the system so that solutions and new development directions can be given for the topic

Trang 23

6

CHAPTER 2: BACKGROUND

In this chapter, we will provide an overview of the technologies and methods employed in this field, including AI technology, the AGV vehicle system, barcodes, and others

2.1 AI Technology

2.1.1 Overview of CNN

The introduction of CNN:

A Convolutional Neural Network, or CNN, is a type of artificial neural network that

is widely used in Deep Learning for image/object recognition and classification Deep Learning recognizes objects in images by employing a CNN CNNs are important in a variety of tasks/functions such as image processing, computer vision tasks such as localization and segmentation, video analysis, recognizing obstacles in self-driving cars, and speech recognition in natural language processing CNNs are very popular in Deep Learning because they play a significant role in these rapidly growing and emerging areas

fully-Figure 2.1: Architecture of CNN network [1]

Input layer: As we know, CNN is inspired from ANN model, so its input is an

image which will hold the pixel values

Trang 24

7

Convolutional layer: through the calculation of the scalar product between their

weights and the area connected to the input volume, will be able to determine the output

of neurons whose local regions of the input are connected

Pooling layer: will just downscale the input along its spatial dimension,

significantly lowering the number of parameters that make up that activation

Fully connected layer: will then carry out the ANN's normal functions and try to

create class scores from the activations for categorization Additionally, it is proposed that ReLu be applied in between these layers to enhance performance The goal of the rectified linear unit, also known as ReLu, is to activate the output of the previous layer's activation by applying a "elementwise" activation function, such as sigmoid

Then, we will specifically analysis about convolutional layer, fully connected layer

Convolutional layer: The convolutional layer is crucial to how CNNs work, as its

name suggests The usage of learnable kernels is the main emphasis of the layer parameters

These kernels often have a low spatial dimension yet cover the entire depth of the input Each filter is involved across the spatial dimensions of the input by the convolutional layer when the data enters it, creating a 2D activation map

The input matrix is processed over a matrix called the kernel to create a feature map for the following layer We carry out convolution mathematical process by sliding the Kernel matrix over the input matrix Each position performs element-by-element matrix multiplication and sums the results onto the feature map

Figure 2.2: This is an image of sliding kernel through input’s matrix [2]

Trang 25

8

More than one axis can be applied to convolution The convoluted image is calculated as follows if we have a two-dimensional image input, I, and a two-dimensional kernel filter, K:

For example, if the network's input is an image of size 64x64x3 (a RGB colored image with a conditionality of 64x64) and the receptive field size is set to 6x6, each neuron in the convolutional layer would have a total of 108 weights (6x6x3 where 3 is the magnitude of connectivity across the volume's depth) To put this in context, a standard neuron in other forms of ANN would have 12, 288 weights

Convolutional layers can also significantly reduce the model's complexity through output optimization These are optimized using three hyper parameters: depth, stride, and zero-padding

The number of neurons within the layer to the same region of the input can be used

to manually set the depth of the output volume produced by the convolutional layers This can be seen in other types of ANNs, where all of the neurons in the hidden layer are previously directly connected to every single neuron Reducing this hyper parameter significantly reduces the total number of neurons in the network, but it also significantly reduces the model's pattern recognition capabilities

We can also define the stride, which is the depth we set around the spatial dimensional of the input to place the receptive field For example, if we set stride to 1,

we will have a heavily overlapped receptive field with extremely large activation Alternatively, increasing the stride will reduce the amount of overlapping and produce

an output with lower spatial dimensions

Zero-padding is the simple process of padding the input's border, and it is an

effective way to give more control over the dimensional of the output volumes

It is critical to understand that by employing these techniques, we will change the spatial dimensional of the output of the convolutional layers Below formula which is provided by the author to calculate for this:

(𝑉 − 𝑅) + 2𝑍

𝑆 + 1

Trang 26

9

Where V denotes the input volume size (height,width,depth), R the receptive field size, Z the amount of zero padding set, and S the stride If the calculated result of this equation is not a whole integer, the stride has been set incorrectly, and the neurons will not fit neatly across the given input

Despite our best efforts, if we use an image input of any real dimensional, our models will still be enormous However, methods for greatly reducing the overall number of parameters within the convolutional layer have been developed

The assumption behind parameter sharing is that if one region feature is useful to compute in one spatial region, it is likely to be useful in another

If we constrain each individual activation map within the output volume to the same weights and bias, the number of parameters produced by the convolutional layer will be drastically reduced

As a result, when the back propagation stage occurs, each neuron in the output will represent the overall gradient, which can be totaled across the depth, updating only a single set of weights rather than all of them

Pooling layer: The goal of pooling layers is to gradually reduce the dimensional of

the representation, and thus the number of parameters and computational complexity of the model The pooling layer runs over each activation map and uses the "MAX" function to scale its dimensionality Most CNNs use max-pooling layers with kernels of dimensionality 2 2 applied with a stride of 2 along the spatial dimensions of the input This reduces the activation map to 25% of its original size while keeping the depth volume at its original size Because of the pooling layer's destructive nature, there are only two commonly observed methods of max-pooling Typically, the pooling layers' stride and filters are both set

Fully connected layer: The neurons in the fully-connected layer are directly

connected to the neurons in the two adjacent layers, but not to any layers within them

2.1.2 Yolo network

The overview of Yolo

Yolo (You Only Look Once) is a CNN network model for object detection, classification, and recognition Yolo's convolutional layers and connected layers are

Trang 27

The architecture of Yolo

According to the author, Yolo network is inspired from GoogleLenet model for image classification Their network consists of 24 convolutional layers, followed by two fully connected layers Instead of the GoogleLeNet inception modules, we simply use 1x1 reduction layers followed by 3x3 convolutional layers The entire network’s architecture is shown below

Figure 2.3: Architecture of Yolo network [3]

The author also trains a fast version of Yolo to test the limits of fast object detection Fast Yolo employs a neural network with fewer convolutional layers (9 as opposed to 24) and filters in those layers Except for the network size, all training and testing parameters are the same between Yolo and Fast Yolo

Trang 28

11

The final output is 7x7x30 tensors

Anchor box: YOLO will need the anchor boxes as the basis of the estimation to

find the bounding box for the object These anchor boxes will be predefined and will closely surround the object The anchor box will be refined later by the regression bounding box algorithm to create a predicted bounding box for the object Each object

in the training image is distributed around an anchor box in a YOLO model If there are two or more anchor boxes surrounding the object, we will choose the one with the highest IoU with the ground truth bounding box

Figure 2.4: Identify anchor box of an object [3]

Each object in the training image is assigned to a cell on the feature map that contains the object's midpoint So, in order to identify an object, we must first identify two components associated with it (cell, anchor box) It's not just the cell or the anchor box

Bounding box: Each grid cell forecasts the B bounding boxes, every bounding box

has five predictions: x, y, w, h, and confidence The (x, y) coordinates represent the center of the box in relation to the grid cell bounds The width and height are calculated

in relation to the entire image Finally, the IOU between the predicted box and any ground truth box is represented by the confidence prediction

In addition, each grid cell predicts C conditional class probabilities, Pr(Classi | Object) These probabilities are dependent on the grid cell that contains an object Regardless of the number of boxes B, we predict only one set of class probabilities per grid cell We multiply the conditional class probabilities by the individual box confidence predictions at test time

Trang 29

12

This provides us with confidence scores for each box based on its class These scores encode both the likelihood of that class appearing in the box and the accuracy with which the predicted box fits the object

Figure 2.5: The active way of YOLO [3]

2.1.3 Yolov7

The Yolov7’s theory:

The YOLO (You Only Look Once) v7 model is the most recent addition to the YOLO family YOLO models are object detectors with a single stage Image frames in

a YOLO model are characterized by a backbone These features are combined and mixed in the neck before being transmitted to the network's head YOLO predicts the locations and classes of objects that should have bounding boxes drawn around them Yolov7 outperforms all known object detectors in terms of both speed and accuracy

in the 5 FPS to 160 FPS range, and has the highest accuracy (56.8% AP) among all known real-time object detectors about 30 FPS on GPU V100

The architecture of Yolov7:

Trang 30

13

YOLOv4, Scaled YOLOv4, and YOLO-R were used to create the architecture Using these models as a foundation, additional experiments were carried out in order to develop new and improved YOLOv7

Yolov7 performs the same recognition as previous Yolo versions, but it is faster and has a shorter inference time Generally, Yolov7 is designed much convolution layer than others version Furthermore, Yolov7’s architecture also has been different compare with the previous architecture When designing a network architecture, researchers commonly prioritize fundamental requirements such as: the number of parameters, amount of computation, and computational density is lower than before

In this version, author not only bases on the following conditions, but also considers the number of element on the convolution layer output tensors Since then, the author has created the CSPVoVNet network, which was inspired by the previous VoVNet network

E-ELAN (Extended Efficient Layer Aggregation Network): The YOLOv7

backbone's computational block is the E-ELAN It is inspired by previous network efficiency research It was created by analyzing the following factors that influence speed and accuracy:

• Memory access fees

• The ratio of I/O channels

• Operation by element

• Activation

• The gradient path

Simply put, the E-ELAN architecture allows the framework to learn more effectively It's built around the ELAN computational block At the time of writing, the ELAN paper had not yet been published When ELAN information becomes available,

we will update the post

Trang 31

14

Figure 2.6: Architecture of E-ELAN [4]

Compound Model Scaling in YOLOv7: Different models are required for

different applications Some require highly accurate models, while others prioritize speed Model scaling is done to meet these requirements and fit it into different computing devices

The following parameters are taken into account when scaling a model size:

• Resolution ( size of the input image)

• Width (number of channels)

• Depth (number of layers)

• Stage (number of feature pyramids)

A common model scaling method is NAS (Network Architecture Search) Researchers use it to iterate through the parameters in order to find the best scaling factors However, methods such as NAS perform parameter-specific scaling In this case, the scaling factors are unrelated

The YOLOv7 paper's authors demonstrate that it can be further optimized using a compound model scaling approach For concatenation-based models, width and depth are scaled in coherence

Trang 32

15

Figure 2.7: Architecture of Compound Model Scaling in YOLOv7 [4]

Trainable Bag of Freebies in YOLOv7:

Planned Re-parameterized Convolution

Averaging a set of model weights is used in re-parameterize techniques to create a model that is more robust to the general patterns that it is attempting to model There has recently been a focus in research on module level re-parameterize, where each component of the network has its own re-parameterize strategy

The YOLOv7 authors use gradient flow propagation paths to determine which network modules should and should not use re-parameterize strategies

Trang 33

16

Figure 2.8: Architecture of Planned Re-parameterized Convolution [4]

The RepConv layer replaces the E-ELAN computational block's 33 convolution layer in the diagram above We conducted experiments by switching or replacing the positions of RepConv, 33 Conv, and Identity connection It is simply an 11 convolutional layer We can see which configurations work and which do not More information about RepConv can be found in the RepVGG paper

In addition to RepConv, YOLOv7 re-parameterized Conv-BN (Convolution Batch Normalization), OREPA (Online Convolutional Re-parameterize), and YOLO-R to achieve the best results

Coarse for Auxiliary and Fine for Lead Loss

The YOLO network head makes the final network predictions, but because it is so far downstream in the network, it may be advantageous to add an auxiliary head somewhere in the middle You are supervising both this detection head and the head that will actually make predictions while training

Because there is less network between the auxiliary head and the prediction, it does not train as efficiently as the final head - so the YOLOv7 authors experiment with different levels of supervision for this head, settling on a coarse-to-fine definition where supervision is passed back from the lead head at different granularity

Figure 2.9: Architecture of Coarse for Auxiliary and Fine for Lead Loss [4]

2.1.4 OCR Theory

The Overview of OCR:

The use of technology to distinguish printed or handwritten text characters within digital images of physical documents, such as a scanned paper document, is known as OCR (optical character recognition) The fundamental process of OCR is to examine the text of a document and translate the characters into code that can be used for data processing Text recognition is another term for optical character recognition (OCR)

The Principle work of OCR:

Trang 34

The dark areas are then further processed to look for alphabetic letters or numeric digits OCR programs use a variety of techniques, but most focus on one character, word, or block of text at a time Following that, characters are identified using one of two algorithms:

Recognition of patterns: OCR programs are fed text examples in various fonts and

formats, which are then compared and recognized in the scanned document

Detection of features: To recognize characters in a scanned document, OCR

programs use rules based on the characteristics of a specific letter or number For comparison, features could include the number of angled lines, crossed lines, or curves

in a character For example, the capital letter "A" could be represented by two diagonal lines intersected by a horizontal line in the middle

2.1.5 Tesseract model

Text recognition is a difficult task in computer vision that has a lot of practical applications Optical character recognition (OCR) enables a variety of automation applications This project focuses on natural image word detection and recognition The targeted problem is significantly more difficult than reading text in scanned documents Because of the limited availability of images, the use case in focus makes it possible to detect the text area in natural scenes with greater accuracy This is accomplished by mounting a camera on a truck and continuously capturing similar images The Tesseract OCR engine is then used to recognize the detected text area

Line Finding:

The line finding algorithm is designed to recognize a skewed page without having

to de-skew it, saving image quality Blob filtering and line construction are critical steps

in the process Assuming that page layout analysis has already provided text regions of roughly uniform text size, a simple percentile height filter removes drop-caps and characters that are vertically touching Because the median height approximates the text

Trang 35

non-BaseLine Fitting:

The blobs are partitioned into groups with a reasonably continuous displacement for the original straight baseline to fit the baselines A least squares fit is used to fit a quadratic spline to the most populous partition (assumed to be the baseline) The quadratic spline has the advantage of being reasonably stable in this calculation, but it has the disadvantage of causing discontinuities when multiple spline segments are required A more traditional cubic spline might be preferable

Chopping and Fixed Pitch Detection:

Tesseract examines the text lines to see if they are fixed pitch When Tesseract encounters fixed pitch text, it chops the words into characters based on the pitch and disables the chopper and associator on these words for the word recognition step

2.2 Barcode Technology

2.2.1 Introduction to barcode

Nowadays, automation in production and management has become a leading trend not only in each country but also in the whole world The use of automatic data acquisition (ADC) technology, in general, and barcode technology, in particular, has brought many obvious benefits in commerce and management One of the most apparent benefits is that inventory, payment, and export management are carried out quickly and accurately Barcodes are more widely used than other ADC technologies because of their economic advantages and high efficiency

Trang 36

19

Norman Joseph Woodland and Bernard Silver developed the idea of barcode technology In 1984, students at Drexel University developed this idea after learning what a food company president wanted to ask his employees to manage To be able to test the entire process automatically

and check consumer goods, retail businesses, food technology, all over the world

Currently, UPC Codes are commonly used for North American countries and Canada

UPC codes are used in the field of code only;

alphanumeric code, and they are reliable UPC codes have codes for error checking

Code

128

Code 128 is applied in the distribution of goods in logistics and transportation, retail supply chain, and manufacturing industry

Code 128 is highly appreciated and famous for its application because it has advantages such as compact barcode,

Trang 37

20

diverse information storage, and can

characters: uppercase, lowercase, characters numbers, standard ASCII characters, and control codes

similarities to UPC code and

is commonly used in European countries

EAN codes are used in the field where only codes are required, no alphanumeric codes are needed, and they are reliable Code with code for error checking

Ministry of National Defense, the Health sector, administrative agencies, and book publishing

Code 39 type of code overcomes the biggest drawback of the above two types of EAN and UPC barcodes, which

is unlimited capacity and can encode both uppercase characters, natural numbers, and some characters

2.2.3 The methods of Barcode scanning

Currently, there are many barcode recognition devices, and each device will have different identification methods, all of which recognize the barcode However, no

Trang 38

• Advantages: low cost

• Cons: This type can only scan barcodes on flat surfaces at close range, not barcodes in curves

Laser Scanner:

Figure 2.11: Laser Scanner

Trang 39

22

Laser scanners consist of a reader that emits a red laser and then uses a reflector to create a light trail that cuts across the surface of the barcode and does not use a light-collecting lens

• Advantages: no need for light collection, very sensitive laser scanning, high accuracy results, can scan barcodes on work surfaces, and long-range scanning capabilities

• Cons: The reading eye is not durable; after a while, it may be weakened due

to the phenomenon of a 'cocoon of barcodes

Read barcodes with Camera Software:

Figure 2.12: Read barcodes with Camera Software

The use of cameras is of great interest and is mainly used for applications that run

on smartphones and jobs that need to handle multiple barcodes simultaneously Use a high-resolution camera and autofocus to take input images and process them using pre-programmed software

• Advantages: Giving users a more intuitive view of the processing and reading

of barcodes on images, a sensitive camera, good focus, and accuracy Process multiple barcodes simultaneously Highly portable, suitable for use on small and compact mobile devices

• Cons: Affected by ambient light, the resolution and focus of the camera must

be high and appropriate, the ability to read barcodes on curved surfaces is poor, and the reading distance is not far

2.2.4 Code 128

Introduction to Code 128

Trang 40

23

Code 128 is a high-density symbology that encodes alphanumeric information It offers security against tampering by means of a checksum digit and byte parity checking This symbology has been extensively used in a variety of applications where a large amount of data must be represented in a compact area Its unusual shape also facilitates double-density encoding of numerical data

Figure 2.13: Code 128

The Part of a Code 128 barcode

A Code 128 barcode consists of an initial "quiet zone," one of three start codes, data, a check character, a stop character, and an additional quiet zone

The silent zone is the unmarked region between the bars and spaces that allows scanners to establish baseline values for the color and reflectance of the object being scanned These data are used to determine on the fly what constitutes a "space" and what constitutes a "bar."

The start code is one of three codes indicating the start of a Code 128 barcode The Code 128 standard defines three different "character sets" or "character modes." The Start-A, Start-B, and Start-C codes specify the character set to be utilized A barcode's character set may be adjusted to encode data more efficiently

Data in Code 128 is encoded in bars and spaces In the table below, the encoding is represented using a binary system, with a "1" indicating a single-width bar and a "0" representing a space Sequences of ones or zeros appear as thicker bars or spaces The following are the actual procedures for calculating the check digit in Code 128:

Step1: Make the first character's value (103, 104, or 105) the initial value of the running checksum

Tiêu đề	Design and Implementation of Classification and Delivery Based on Computer Vision
Tác giả	Phạm Minh Quân, Nguyễn Hoài Phương Uyên
Người hướng dẫn	Trương Quang Phúc, M.Eng
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Computer Engineering Technology
Thể loại	graduation project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	115
Dung lượng	6,77 MB