To reduce these consumable resources, more and more algorithms and models over time have been introduced, including the YOLOv5 model for the recognition problem, specifically applied to
Trang 1HCM UNIVERSITY OF TECHNOLOGY AND
EDUCATION FACILITY FOR HIGH-QUALITY TRADING DEPARTMENT OF COMPUTER AND COMMUNICATIONS
Nguyễn Hoài Phương Uyên
Ho Chi Minh City, Sunday, November 28, 2021
Trang 2HCM UNIVERSITY OF TECHNOLOGY AND
EDUCATION FACILITY FOR HIGH-QUALITY TRADING DEPARTMENT OF COMPUTER AND COMMUNICATIONS
TOPIC: FACE MASK DETECTION
MAJOR: COMPUTER ENGINEERING
Group 10:
18119053
Nguyễn Hoài Phương Uyên
Supervise Teacher : PhD.Trương Ngọc Sơn
Trang 3INSTRUCTOR'S COMMENT TABLE
General comment:
………
………
………
………
Trang 4SUMMARY
Trang 5
CONTENT
LIST OF PICTURES 1
LIST OF TABLES 2
ABBREVIATIONS 3
CHAPTER 1: INTRODUCTION 4
1.1 Introduction 4
1.3 Topic limit 5
1.4 Research Method 5
1.5 Object and Scope of Study 5
1.6 Report book layout 5
CHAPTER 2: THEORY 7
2.1 Overview 7
2.2 Architecture of Yolo 8
2.2 Yolo's output 10
2.2.1 Predict on feature map 12
2.2.2 Anchor Box 13
2.2.3 Loss Function 14
2.3 Prediction on the bounding box 15
2.3.1 Non-max suppression 16
2.4 YOLOv5 Architecture 17
2.5 Face Mask Detection 18
CHAPTER 3: DESIGN SOFTWARE 20
3.1 THE ACTIVE FUNCTION OF SOFTWARE 20
3.1.1 Data Collection: 20
3.2 The training processing 20
3.2.1 Start training processing 22
CHAPTER 4: RESULTS 25
CHAPTER 5: CONCLUSION AND DEVELOPMENTS 25
5.1 CONCLUSION 25
5.2 DEVELOPMENTS 26
APPENDIX 26
REFERENCES 28
Trang 6LIST OF PICTURES
Image 2 1: YOLO's Architecture 8
Image2 2: The layers in Dark-net 53 network 9
Image2 3: The activative way of YOLO 10
Image2 4 The output’s architecture of YOLO 11
Image2 5: Some feature maps in YOLOv3 with 416x416 input, output’s feature maps is 13x13,26x26,52x52 12
Image2 6: Identify anchor box of an object 13
Image2 7 Algorithm decides whether class for cell 14
Image2 8: The formula estimates bounding box from anchor box 16
Image2 9: Non -max suppression From 3 initial bounding boxes are decreased to 1 bounding box 17
Image3 1: Use roboflow.ai to create a dataset and augmentation method 20 Image3 2: Clone repository and set up all dependencies in YOLOv5 21
Image3 3: 21
Image3.3 + 3 4: Use URL path to link directly to dataset in roboflow.ai 21
Image3 5: Dataset is contained in content’s folder 22
Image3 6: Figure of data.yaml file 22
Image3 7: Download the model to train 22
Image3 8: Figure of training process 23
Image3 9: Display results after training process 23
Image3 10 Figure of detecting process 24
Image4 1: : Results of training process 25 Image4 2: Results of detecting process 25
Trang 7LIST OF TABLES
Trang 8ABBREVIATIONS
1 CNN: Convolution Neural Network
2 Relu: Rectified Linear Unit
3 YOLO: You Only Look Once
4 SSD: Single Shot Detection
5 IoU: Interestion Over Union
6 CSPNet: Cross Stage Partial Network
7 PANet: Path Aggregation
8 FPN: Feature Pyramid Network
9 OpenCV: Open Computer Vision
Trang 9CHAPTER 1: INTRODUCTION 1.1 Introduction
On March 11, 2020, the World Health Organization (WHO) issued a statement calling "COVID-19" a "Global Pandemic." To prevent the rapid spread of the pandemic, besides the encouragement given by WHO about wearing masks in crowded places, the Government of Vietnam has also required people to wear masks in public areas to limit the spread of the virus Prevent the spread of disease However, it is challenging and expensive to monitor the implementation of the Government's instructions with the old methods because of the lack of resources To support and improve monitoring and reminding people, our team will build a program to detect people not wearing masks in real-time automatically
Today, artificial intelligence (AI) is increasingly popular and profoundly changes many aspects of daily life Computer vision (CV) is an important area of AI that includes acquiring, processing digital images, analyzing and recognizing images Deep learning neural network (Deep Learning Network) is a field of study of algorithms and computer programs so that computers can learn and make predictions like humans
It is applied to many different applications such as science, engineering, other fields of life, and classification and object detection applications A typical example is CNN (Convolutional Neural Network) applied to automatic recognition, learning distinguishing patterns from images by successively stacking layers on top of each other In many applications, CNN is now considered a good example Full image classifier and leverages technologies in the field of computer vision that leverage machine learning However, besides that, CNN technology consumes many resources such as bandwidth, memory, and hardware processing capacity to classify an object
To reduce these consumable resources, more and more algorithms and models over time have been introduced, including the YOLOv5 model for the recognition problem, specifically applied to the topic "Face mask detection."
1.2 Topic goal
Apply basic knowledge about the process of training neural networks Understand the theoretical and architectural basis of the Yolov5 model for the object recognition problem
Trang 10Building a model capable of training different face mask detection datasets (Kaggle's face mask detection dataset and self-generated face mask detection dataset).Face recognition with and without a mask
Based on the learned knowledge about training a neural network
Collect documents, refer to previous related applications
Consult and follow the instructor's instructions
1.5 Object and Scope of Study
Identify people who are wearing masks and people not wearing masks in the dataset
1.6 Report book layout
The thesis has a total of 5 chapters:
• Chapter 1 - Overview
In this chapter, learn about the issues that form the topic Attached are some contents and limitations of the topic that the project team has set
• Chapter 2 – Theoretical Basis
An introduction to the background knowledge and the technology and software used in the project, including knowledge of image processing, neural network theory, characteristics, and how to train a dataset in YOLOv5
• Chapter 3 – System Design
Plan to use the sample set, interpret the model's parameters, the training process, the process of testing a face mask recognition system on the YOLOv5 platform
• Chapter 4 – Results
Check the results of the training process and the recognition process
• Chapter 5- Conclusion and development direction
Trang 11In this chapter, we will present the project results that have been achieved compared to the set objectives and point out some research and development directions for the topic
Trang 12CHAPTER 2: THEORY
In recent years, object detection has become one of the most popular deep learning topics because of its high application capabilities, ease of data standardization, and widespread applicability New object detection algorithms, such as YOLO and SSD, are fast and accurate, allowing the author to be seen in real-time, even faster than people without sacrificing accuracy Models become lighter as well, and they can work with IoT devices to create intelligent machines
2.1 Overview
YOLO (You Only Look Once) is a CNN network model for detecting, classifying, and recognizing objects The convolutional layers and connected layers that makeup YOLO are combined The convolutional layers will extract features in an image, while the connected layers will predict the probabilities and coordinates in an object
Although YOLO isn't the best algorithm, it is the fastest in object identification models It can achieve near real-time speeds, but the accuracy is not significantly reduced compared to the top models
Because YOLO is an object detection technique, the model's purpose is to predict labels for objects in classification tasks and to locate the object's location As a result, YOLO may detect many objects with label differences in a snap instead of assigning a single label to an image
One of the benefits of YOLO is that it only takes information from the whole image at once, predicting the entire object box containing the objects Because the model is created end-to-end, it should be trained entirely by gradient descent YOLO has had a total of 5 sessions versions to yet (v1,v2,v3,v4,v5) The current measurement version, v5, can solve the disadvantages of previous versions, such as errors in estimating the position of objects, limitations due to spatial constraints on bounding boxes; each grid cell can only predict a small bounding box
Trang 132.2 Architecture of Yolo
The architecture of YOLO consists of: base-network is all of the convolution layers take to extract features Then, the following part is the extract layers are applied
to identify the object in features map in base-network
The base network of YOLO is primarily made up of convolutional layers and fully connected layers; YOLO architectures are also very flexible and can be customized to fit a variety of input shapes
Image 2 1 YOLO's Architecture:
The base network component of the Darknet Architecture has the function of feature extraction Extra layers that predict the object's label and bounding box coordinates will use the base network's output, a feature map with a size of 7x7x1024 as input
The author uses a network feature extractor called darknet-53 in the third version of YOLO, i.e., YOLOv3 This network is made up of 53 convolutional layers that connect After that, batch normalization and a Leaky activation Relu are applied to each layer The author used filters with a size of 2 to downsample the output after each convolution layer to reduce the output size This technique aims to reduce the number
of parameters in the model
Trang 14Image2 2 The layers in Dark-net 53 network :
When the images are added to the model, they will be scaled to the same size The size is suitable for the model's input shape, and it is subsequently collected into a batch for training
YOLO now supports two main input formats: 416x416 and 608x608 Every input has its own layer design that corresponds to the input's form The form reduces exponentially by 2 after passing through the convolution layers Finally, a feature map
Trang 15of the same size is created On each block of the feature map, a small object is used to predict a feature
The input will determine the size of the feature map The feature map for input 416x416 has the dimensions 13x13, 26x26, and 52x52 When the input is 608x608, the feature map will be 19x19, 38x38, and 72x72
2.1.2 The activation principle of Yolov5
The model's input is an image; the model will identify whether the image has any objects or not, then determine the coordinates of the object in the image The input image is divided into SxS cells, usually 3x3,7x7,9x9,… Does the cell division affect the object detection of the mode?
Image2 3 The active way of YOLO :
With input is an image, the model outputs a 3-D matrix of size SxSx(5 x N + M) with the number of parameters per cell (5 x N + M), where N and M are the numbers
of Boxes and Classes that each cell must predict, respectively Consider the example in the image above, which is divided into 7x7 cells Each cell must predict two bounding boxes and three objects: a donor, a car, and a bicycle The output will be 7x7x13, with
13 parameters for each cell, yielding a bounding box (7x7x2 = 98)
2.2 Yolo's output
The output of the YOLO model is a vector that will include the following components:
Trang 16yT= [0, 𝑡⏟𝑥,𝑡𝑦,𝑡𝑤,𝑡𝑦
𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔𝑏𝑜𝑥
, 𝑝⏟1,𝑝2,….𝑝𝑐 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑐 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
: Help define the bounding box Where t , t are the coordinates x y
of the center and tw,ty are the width and length dimensions of the bounding box
(n_class + 5) x3= 85x3=255
Trang 17The original image is a 13x13 feature map On each cell of the feature map, we select three anchor boxes of different sizes, respectively, Box 1, Box 2, and Box 3, so that the center of the anchor boxes coincides with the cell Then the output of YOLO is
a concatenated vector of 3 bounding boxes The attributes of a bounding box are described as the last line in the figure
2.2.1 Predict on feature map
YOLO (more particularly, YOLOv3) predictions on several feature maps, similar to SSDs Large objects can be predicted with the use of small initial feature maps The following feature maps are more significant in size than the others The anchor box is maintained at constant size to assist in predicting the size of small objects
Image2 5: Some feature maps in YOLOv3 with 416x416 input, output’s
feature maps is 13x13,26x26,52x52
We'll use three anchor boxes to predict objects in each cell of the feature maps
As a result, a model YOLO will also have 9 different anchor boxes (3 feature map x 3 anchor boxes)
Simultaneously, the YOLOv3 model is produced on an SxS square feature map, with the following number of anchor boxes: SxSx3 As a result, the number of anchor boxes on an image will be as follows:
(13x13+26+52x52)x3=10647( anchor box)
This is a huge number and is the cause of the training process
The YOLO model is extremely slow because we need to predict labels and bounding boxes on 10647 bounding boxes simultaneously Some notes when training YOLO:
Trang 18• When training YOLO will need to have more RAM to save get 10647 bounding boxes as in this architecture
• In models classification, batch size cannot be set too large because it is straightforward to run out of memory To fit in RAM, YOLO's darknet package divides a batch into subdivisions
• A step on the YOLO classification model takes many times longer to process than other classification models As a result, setting the training limitation steps for Small YOLO is recommended For problems with less than 5 classes, a quick test of fewer than 5000 steps is allowed Models with more classes might increase the number of steps significantly, depending on the user
2.2.2 Anchor Box
To find the bounding box for the object, YOLO will need the anchor boxes as the basis of the estimation These anchor boxes will be predefined and will surround the object relatively accurately Later, the regression bounding box algorithm will refine the anchor box to create a predicted bounding box for the object In a YOLO model:
• Each object in the training image is distributed about an anchor box In the case
of two or more anchor boxes surrounding the object, we will determine the anchor box that has the highest IoU with the ground truth bounding box