Research report topic face mask detection major computer engineering

To reduce these consumable resources, more and more algorithms and models over time have been introduced, including the YOLOv5 model for the recognition problem, specifically applied to

Trang 1

HCM UNIVERSITY OF TECHNOLOGY AND

EDUCATION FACILITY FOR HIGH-QUALITY TRADING DEPARTMENT OF COMPUTER AND COMMUNICATIONS

Nguyễn Hoài Phương Uyên

Ho Chi Minh City, Sunday, November 28, 2021

Trang 2

HCM UNIVERSITY OF TECHNOLOGY AND

EDUCATION FACILITY FOR HIGH-QUALITY TRADING DEPARTMENT OF COMPUTER AND COMMUNICATIONS

TOPIC: FACE MASK DETECTION

MAJOR: COMPUTER ENGINEERING

Group 10:

18119053

Nguyễn Hoài Phương Uyên

Supervise Teacher : PhD.Trương Ngọc Sơn

Trang 3

INSTRUCTOR'S COMMENT TABLE

General comment:

………

Trang 4

SUMMARY

Trang 5

CONTENT

LIST OF PICTURES 1

LIST OF TABLES 2

ABBREVIATIONS 3

CHAPTER 1: INTRODUCTION 4

1.1 Introduction 4

1.3 Topic limit 5

1.4 Research Method 5

1.5 Object and Scope of Study 5

1.6 Report book layout 5

CHAPTER 2: THEORY 7

2.1 Overview 7

2.2 Architecture of Yolo 8

2.2 Yolo's output 10

2.2.1 Predict on feature map 12

2.2.2 Anchor Box 13

2.2.3 Loss Function 14

2.3 Prediction on the bounding box 15

2.3.1 Non-max suppression 16

2.4 YOLOv5 Architecture 17

2.5 Face Mask Detection 18

CHAPTER 3: DESIGN SOFTWARE 20

3.1 THE ACTIVE FUNCTION OF SOFTWARE 20

3.1.1 Data Collection: 20

3.2 The training processing 20

3.2.1 Start training processing 22

CHAPTER 4: RESULTS 25

CHAPTER 5: CONCLUSION AND DEVELOPMENTS 25

5.1 CONCLUSION 25

5.2 DEVELOPMENTS 26

APPENDIX 26

REFERENCES 28

Trang 6

LIST OF PICTURES

Image 2 1: YOLO's Architecture 8

Image2 2: The layers in Dark-net 53 network 9

Image2 3: The activative way of YOLO 10

Image2 4 The output’s architecture of YOLO 11

Image2 5: Some feature maps in YOLOv3 with 416x416 input, output’s feature maps is 13x13,26x26,52x52 12

Image2 6: Identify anchor box of an object 13

Image2 7 Algorithm decides whether class for cell 14

Image2 8: The formula estimates bounding box from anchor box 16

Image2 9: Non -max suppression From 3 initial bounding boxes are decreased to 1 bounding box 17

Image3 1: Use roboflow.ai to create a dataset and augmentation method 20 Image3 2: Clone repository and set up all dependencies in YOLOv5 21

Image3 3: 21

Image3.3 + 3 4: Use URL path to link directly to dataset in roboflow.ai 21

Image3 5: Dataset is contained in content’s folder 22

Image3 6: Figure of data.yaml file 22

Image3 7: Download the model to train 22

Image3 8: Figure of training process 23

Image3 9: Display results after training process 23

Image3 10 Figure of detecting process 24

Image4 1: : Results of training process 25 Image4 2: Results of detecting process 25

Trang 7

LIST OF TABLES

Trang 8

ABBREVIATIONS

1 CNN: Convolution Neural Network

2 Relu: Rectified Linear Unit

3 YOLO: You Only Look Once

4 SSD: Single Shot Detection

5 IoU: Interestion Over Union

6 CSPNet: Cross Stage Partial Network

7 PANet: Path Aggregation

8 FPN: Feature Pyramid Network

9 OpenCV: Open Computer Vision

Trang 9

CHAPTER 1: INTRODUCTION 1.1 Introduction

On March 11, 2020, the World Health Organization (WHO) issued a statement calling "COVID-19" a "Global Pandemic." To prevent the rapid spread of the pandemic, besides the encouragement given by WHO about wearing masks in crowded places, the Government of Vietnam has also required people to wear masks in public areas to limit the spread of the virus Prevent the spread of disease However, it is challenging and expensive to monitor the implementation of the Government's instructions with the old methods because of the lack of resources To support and improve monitoring and reminding people, our team will build a program to detect people not wearing masks in real-time automatically

Today, artificial intelligence (AI) is increasingly popular and profoundly changes many aspects of daily life Computer vision (CV) is an important area of AI that includes acquiring, processing digital images, analyzing and recognizing images Deep learning neural network (Deep Learning Network) is a field of study of algorithms and computer programs so that computers can learn and make predictions like humans

It is applied to many different applications such as science, engineering, other fields of life, and classification and object detection applications A typical example is CNN (Convolutional Neural Network) applied to automatic recognition, learning distinguishing patterns from images by successively stacking layers on top of each other In many applications, CNN is now considered a good example Full image classifier and leverages technologies in the field of computer vision that leverage machine learning However, besides that, CNN technology consumes many resources such as bandwidth, memory, and hardware processing capacity to classify an object

To reduce these consumable resources, more and more algorithms and models over time have been introduced, including the YOLOv5 model for the recognition problem, specifically applied to the topic "Face mask detection."

1.2 Topic goal

Apply basic knowledge about the process of training neural networks Understand the theoretical and architectural basis of the Yolov5 model for the object recognition problem

Trang 10

Building a model capable of training different face mask detection datasets (Kaggle's face mask detection dataset and self-generated face mask detection dataset).Face recognition with and without a mask

Based on the learned knowledge about training a neural network

Collect documents, refer to previous related applications

Consult and follow the instructor's instructions

1.5 Object and Scope of Study

Identify people who are wearing masks and people not wearing masks in the dataset

1.6 Report book layout

The thesis has a total of 5 chapters:

• Chapter 1 - Overview

In this chapter, learn about the issues that form the topic Attached are some contents and limitations of the topic that the project team has set

• Chapter 2 – Theoretical Basis

An introduction to the background knowledge and the technology and software used in the project, including knowledge of image processing, neural network theory, characteristics, and how to train a dataset in YOLOv5

• Chapter 3 – System Design

Plan to use the sample set, interpret the model's parameters, the training process, the process of testing a face mask recognition system on the YOLOv5 platform

• Chapter 4 – Results

Check the results of the training process and the recognition process

• Chapter 5- Conclusion and development direction

Trang 11

In this chapter, we will present the project results that have been achieved compared to the set objectives and point out some research and development directions for the topic

Trang 12

CHAPTER 2: THEORY

In recent years, object detection has become one of the most popular deep learning topics because of its high application capabilities, ease of data standardization, and widespread applicability New object detection algorithms, such as YOLO and SSD, are fast and accurate, allowing the author to be seen in real-time, even faster than people without sacrificing accuracy Models become lighter as well, and they can work with IoT devices to create intelligent machines

2.1 Overview

YOLO (You Only Look Once) is a CNN network model for detecting, classifying, and recognizing objects The convolutional layers and connected layers that makeup YOLO are combined The convolutional layers will extract features in an image, while the connected layers will predict the probabilities and coordinates in an object

Although YOLO isn't the best algorithm, it is the fastest in object identification models It can achieve near real-time speeds, but the accuracy is not significantly reduced compared to the top models

Because YOLO is an object detection technique, the model's purpose is to predict labels for objects in classification tasks and to locate the object's location As a result, YOLO may detect many objects with label differences in a snap instead of assigning a single label to an image

One of the benefits of YOLO is that it only takes information from the whole image at once, predicting the entire object box containing the objects Because the model is created end-to-end, it should be trained entirely by gradient descent YOLO has had a total of 5 sessions versions to yet (v1,v2,v3,v4,v5) The current measurement version, v5, can solve the disadvantages of previous versions, such as errors in estimating the position of objects, limitations due to spatial constraints on bounding boxes; each grid cell can only predict a small bounding box

Trang 13

2.2 Architecture of Yolo

The architecture of YOLO consists of: base-network is all of the convolution layers take to extract features Then, the following part is the extract layers are applied

to identify the object in features map in base-network

The base network of YOLO is primarily made up of convolutional layers and fully connected layers; YOLO architectures are also very flexible and can be customized to fit a variety of input shapes

Image 2 1 YOLO's Architecture:

The base network component of the Darknet Architecture has the function of feature extraction Extra layers that predict the object's label and bounding box coordinates will use the base network's output, a feature map with a size of 7x7x1024 as input

The author uses a network feature extractor called darknet-53 in the third version of YOLO, i.e., YOLOv3 This network is made up of 53 convolutional layers that connect After that, batch normalization and a Leaky activation Relu are applied to each layer The author used filters with a size of 2 to downsample the output after each convolution layer to reduce the output size This technique aims to reduce the number

of parameters in the model

Trang 14

Image2 2 The layers in Dark-net 53 network :

When the images are added to the model, they will be scaled to the same size The size is suitable for the model's input shape, and it is subsequently collected into a batch for training

YOLO now supports two main input formats: 416x416 and 608x608 Every input has its own layer design that corresponds to the input's form The form reduces exponentially by 2 after passing through the convolution layers Finally, a feature map

Trang 15

of the same size is created On each block of the feature map, a small object is used to predict a feature

The input will determine the size of the feature map The feature map for input 416x416 has the dimensions 13x13, 26x26, and 52x52 When the input is 608x608, the feature map will be 19x19, 38x38, and 72x72

2.1.2 The activation principle of Yolov5

The model's input is an image; the model will identify whether the image has any objects or not, then determine the coordinates of the object in the image The input image is divided into SxS cells, usually 3x3,7x7,9x9,… Does the cell division affect the object detection of the mode?

Image2 3 The active way of YOLO :

With input is an image, the model outputs a 3-D matrix of size SxSx(5 x N + M) with the number of parameters per cell (5 x N + M), where N and M are the numbers

of Boxes and Classes that each cell must predict, respectively Consider the example in the image above, which is divided into 7x7 cells Each cell must predict two bounding boxes and three objects: a donor, a car, and a bicycle The output will be 7x7x13, with

13 parameters for each cell, yielding a bounding box (7x7x2 = 98)

2.2 Yolo's output

The output of the YOLO model is a vector that will include the following components:

Trang 16

yT= [0,  𝑡⏟𝑥,𝑡𝑦,𝑡𝑤,𝑡𝑦

𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔𝑏𝑜𝑥

, 𝑝⏟1,𝑝2,….𝑝𝑐 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑐 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

: Help define the bounding box Where t , t are the coordinates x y

of the center and tw,ty are the width and length dimensions of the bounding box

(n_class + 5) x3= 85x3=255

Trang 17

The original image is a 13x13 feature map On each cell of the feature map, we select three anchor boxes of different sizes, respectively, Box 1, Box 2, and Box 3, so that the center of the anchor boxes coincides with the cell Then the output of YOLO is

a concatenated vector of 3 bounding boxes The attributes of a bounding box are described as the last line in the figure

2.2.1 Predict on feature map

YOLO (more particularly, YOLOv3) predictions on several feature maps, similar to SSDs Large objects can be predicted with the use of small initial feature maps The following feature maps are more significant in size than the others The anchor box is maintained at constant size to assist in predicting the size of small objects

Image2 5: Some feature maps in YOLOv3 with 416x416 input, output’s

feature maps is 13x13,26x26,52x52

We'll use three anchor boxes to predict objects in each cell of the feature maps

As a result, a model YOLO will also have 9 different anchor boxes (3 feature map x 3 anchor boxes)

Simultaneously, the YOLOv3 model is produced on an SxS square feature map, with the following number of anchor boxes: SxSx3 As a result, the number of anchor boxes on an image will be as follows:

(13x13+26+52x52)x3=10647( anchor box)

This is a huge number and is the cause of the training process

The YOLO model is extremely slow because we need to predict labels and bounding boxes on 10647 bounding boxes simultaneously Some notes when training YOLO:

Trang 18

• When training YOLO will need to have more RAM to save get 10647 bounding boxes as in this architecture

• In models classification, batch size cannot be set too large because it is straightforward to run out of memory To fit in RAM, YOLO's darknet package divides a batch into subdivisions

• A step on the YOLO classification model takes many times longer to process than other classification models As a result, setting the training limitation steps for Small YOLO is recommended For problems with less than 5 classes, a quick test of fewer than 5000 steps is allowed Models with more classes might increase the number of steps significantly, depending on the user

2.2.2 Anchor Box

To find the bounding box for the object, YOLO will need the anchor boxes as the basis of the estimation These anchor boxes will be predefined and will surround the object relatively accurately Later, the regression bounding box algorithm will refine the anchor box to create a predicted bounding box for the object In a YOLO model:

• Each object in the training image is distributed about an anchor box In the case

of two or more anchor boxes surrounding the object, we will determine the anchor box that has the highest IoU with the ground truth bounding box

Tiêu đề	Face Mask Detection
Tác giả	Phạm Minh Quân, Nguyễn Hoài Phương Uyên
Người hướng dẫn	PhD.Trương Ngọc Sơn
Trường học	HCM University of Technology and Education
Chuyên ngành	Computer Engineering
Thể loại	Research report
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	33
Dung lượng	6,03 MB