Báo cáo nghiên cứu khoa học: Developing Automatic Surveillance System For Personal Protective Equipment Compliance In Real Construction Sites Using Deep Neural Network Models

VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL STUDENT RESEARCH REPORT Developing Automatic Surveillance System for Personal using Deep Neural Network Models Team Leader: Nguyen

Research Topic . -c 11211121121101 111 111111110111 110 11011 11 E11 HH TH HH nà Hy 9 2 Student’s Information 011077

English: Developing Automatic Surveillance System for Personal Protective Equipment Compliance in Real Construction Sites using Deep Neural Network Models

Vietnamese: Phat trién hé thong tự động giám sát việc tuân thu thiết bị bảo hộ lao động cá nhân tại các công trường thực tê băng cách sử dụng mô hình mạng nơ-ron sâu

Name Student ID Class Program | Year

Nguyễn Duy Thức 21070278 BDA2021A BDA 3rd Nguyễn Mạnh Trường Lâm | 21070570 BDA2021A BDA 3rd

Nguyễn Lê Quang Hiếu 21070069 BDA2021A BDA 3rd

Tran Minh Tuan Kiét 21070436 ICE2021B ICE 3rdNguyễn Tuan Minh 23070471 AIT2023B AIT Ist

Concerning rationale of the Study ccccccccssecsseeseesseeseeeeeseeeseeeseeesesseeeeeeseenes 10 2 Research Questions

In the first 6 months of 2023, the growth rate of the construction industry increased by 4.74% over the same period in 2022, higher than the GDP growth rate of the whole country Accounts for 37.12% of the total GDP in Vietnam in 2023 (~ 159.62 billion USD) and 17.2 million employees (~32.8% of the total workforce), an increase of 248.2 thousand people, equivalent to an increase of 1.5% compared to the same period in 2022; the construction industry is one of the largest sectors of Vietnam economy [1] Despite its significant contribution to the economy, the construction industry has also gained a reputation as one of the most perilous industries due to the alarming rate of workplace accidents and worker injuries.

Table 1 Statistics of accidents in construction sites in Vietnam

First 6 months | First 6 months | Increase (+)/

No Statistical details of 2022 of 2023 Decrease (-)

3 Number of fatal occupational accidents 292 273 -19(-6,02%)

5 Number of seriously injured people 689 715 +26(43,77%)

According to reports of 63/63 provinces and cities, in the first 6 months of 2023, there were 3.201 occupational accidents across the country (a decrease of 707 cases, corresponding to 18,09% compared to the first 6 months of 2022), causing more than

3262 people to be injured (a decrease of 739 people corresponding to 18.47% compared to the first 6 months of 2022) [2] This total number of occupational accidents occurred in both the sector with labor relations and the sector where employees worked without labor contracts In detail, the number of fatal occupational accidents is 345, down 21 cases compared to the first 6 months of 2022; the number of people who died from occupational accidents was 353 people, and 784 people were seriously injured.

Despite a decline in statistics compared to the previous year, the high number of accidents and fatalities remains a significant concern This alarming trend underscores the need to reassess and improve labor safety measures at construction sites in the country.

Personal Protective Equipment (PPE) like helmets, vests, gloves, and boots can significantly reduce workplace injuries and fatalities [3] Regulatory bodies mandate PPE use in hazardous environments to mitigate risks [4] While employers bear the responsibility to enforce PPE compliance, employees often neglect it due to inadequate safety knowledge, discomfort, or perceived hindrance [5].

Although the government has dedicated significant resources to educating workers on the importance of utilizing appropriate PPE, monitoring large groups of workers for compliance with PPE regulations can be both costly and labor-intensive from a practical standpoint [6].

How can people apply Deep Neural Network (DNN) to solve that problem?

Several studies have applied computer vision through surveillance cameras to detect and warn people in dangerous environments who are not wearing protective equipment. With the development of deep neural network models, some studies have actually yielded promising results [7].

However, these previous studies are restricted in detecting small objects which are far from surveillance cameras In addition, the datasets utilized in these studies remain relatively restricted and fail to encompass several varieties of protective equipment that are commonly employed on real construction sites.

Object and Scope of the Study .e cc ecceeceeceseeeseeeeeseeeeseeeeeeseeeseeeseeseeeseeeaeeeeeeats 12

This research aimed to develop a system that automatically monitors and warns workers when they lack PPE on dangerous construction sites to limit unwanted accidents in these places Meanwhile, it will also assist authorities in identifying and improving less safe construction sites.

This study was conducted with the following three main objectives:

(1) A rich and quality dataset with about 4,000 images of workers wearing personal protective equipment working at construction sites from many angles, distances, and positions.

(2) Evaluation of different types and versions of deep neural network models for the detection of PPE compliance in real construction sites.

(3) Deployment of a real-time automatic surveillance system for tracking and managing the PPE compliance of workers.

Roboflow is the tool used to annotate labels for objects Preprocessing (or data cleaning) methods are used to obtain a standard dataset and meet model development requirements Additionally, modern machine learning and deep learning neural network theories are considered to improve the model's performance Python languages, OpenCV, and Pytorch libraries are used to program and build various neural network models Experimental methods and quantitative evaluation based on parameters such as precision, recall, and F1-score are also used to evaluate the effectiveness of the models.

The content of the research topic is structured into chapters to solve the problems mentioned in the research objectives Accordingly, the study includes the overview of DNN models, object detection problems, YOLO architecture and interpretation of results The main body of the thesis consists of four chapters:

Chapter 1: Overview of Deep Neural Network Models

Chapter 2: Object Detection and YOLO Architecture

Chapter 3: PPE Detection Based on YOLO Models

In the construction realm, real-time workwear detection has surged in significance Monitoring staff attire compliance with personal protective equipment (PPE) in real-time enhances workplace safety and accident prevention This literature review investigates the burgeoning body of research dedicated to this crucial area.

There 1s some previous research focusing on the development of a real-time computer vision system for the detection of safety gear worn by construction workers The objective of the research is to improve the safety of workers by ensuring that they are wearing the appropriate safety gear, such as hard hats and safety vests The proposed system uses a deep learning-based approach for the detection of safety gear [8]. Specifically, the YOLO algorithm is used for object detection, which allows for real- time processing of video data The system is trained on a dataset of images of construction workers wearing various types of safety gear One of these research has the results achieving an overall detection accuracy of 83.77% for a person, 88.51% for hard hats, and 81.47% for safety vests The system was also able to operate in real-time with the speed of 36.62 ms for detecting an image.

However, the authors of this paper acknowledge potential limitations of the system, such as the need for appropriate lighting conditions and the ability to detect false positives or false negatives Furthermore, images from recent research are taken individually from a variety of devices, locations, times, perspectives, and projects Therefore, testing images is more challenging Thus, we have solutions for this problem by collecting images taken from many different angles with different gestures in the same as well as different places and time Furthermore, in previous studies, they used old-version models leading to unoptimized results and performance, such as: YOLOv3 (even a sufficiently-trained model may find it difficult to detect smaller PPE components) [9] There are also classes to be detected limits and some classes with not high accuracy (over 85%) develop optimal algorithms to improve training performance Thus, we will train with models that have high recognition ability with good performance, in a short time and bring satisfactory results with the latest up-to-date versions of YOLO such as YOLOv8 and

In this research, we not only focus on training datasets with many different models but also apply combinations and develop optimal algorithms such as SAHI to improve the efficiency of training capacity In addition, we developed a system that not only helps to identify labor protection equipment well but also detects and warns of dangerous areas on construction sites In particular, we built a website for practical application with the most realistic experience for users to help receive, process, and produce reliable results displayed on the screen after users upload images to the system and then process them.

Overall, previous papers about PPE Detection present an application of computer vision techniques for improving safety in the construction industry The results are promising and suggest that real-time detection of safety gear is a feasible approach that could potentially save lives and reduce the risk of accidents in the workplace, but there are still some limitations in the system that need to be improved.

OVERVIEW OF DEEP NEURAL NETWORK MODELS

Deep Neural Network ca

Deep Neural Network (DNN) is a type of Artificial Neural Network that consists of multiple layers of interconnected processing nodes, called neurons [10] These layers enable the network to learn hierarchical representations of data, making it particularly effective for a wide range of tasks in fields such as computer vision, natural language processing, and speech recognition.

In a DNN, the input layer receives raw data, such as an image or audio waveform The data is then passed through a series of hidden layers, each of which applies a set of mathematical operations to transform the data in some way The final layer produces an output, which may be a prediction or classification of the input data.

DNNs are typically trained using a process called backpropagation, in which the network's weights are adjusted to minimize a loss function that measures the difference between the network's predictions and the true labels of the training data This process is typically done using a large dataset and powerful computing resources, such as GPUs.

Convolutional Neural Network 0 ececceeseeceeseeeseeeceeeeeeseeececeeeaeceaeeeeeeeeeeeeeaeees 16 3 Algorithms for improving CNNS .:ccecccecesseeseeeseeeceeeeeeseeececaeceseeeneceeeeenaeeeseeaes 18 3.1 DrOPOUt ƯGHIadiiiaidiiiidiẳiiiÕỶ

Convolutional Neural Network (CNN) is a type of DNN that is primarily used in image and video recognition and classification tasks [11] It is a feedforward neural network that takes input in the form of an image and transforms it through several layers of convolution, pooling, and fully connected layers, leading to a prediction or classification output.

The CNN architecture is inspired by the organization of the animal visual cortex The neurons in the visual cortex are arranged in a hierarchical fashion, where neurons at lower levels respond to basic features like edges, lines, and curves, and neurons at higher levels respond to more complex patterns and objects Similarly, in a CNN, the early layers detect simple features like edges and lines, while the deeper layers identify more complex patterns and shapes The basic components of a Convolutional Neural Network (CNN) include convolutional layers, pooling layers, and fully connected layers.

Convolutional layers are the core building blocks of a CNN They apply a set of learnable filters to the input image, performing a convolution operation that extracts features from the image Each filter extracts a different feature, such as edges, lines, or shapes The output of a convolutional layer is a set of feature maps that highlight the presence of these features in the input image Typically, multiple convolutional layers are stacked one after the other, allowing the network to learn more complex features [5]. fc6 fc7 fe8 1x 1x4096 1x 1x 1000

Cỷ max pooling ama fully connected+ReLU

Pooling layers are used to reduce the spatial size of the feature maps produced by the convolutional layers They do this by taking the maximum, average, or sum of a small region of the feature map, known as the pooling window This operation reduces the number of parameters in the network and makes it more robust to small translations and distortions in the input image.

The output of the final pooling layer is flattened into a vector and passed through a series of fully connected layers These layers are similar to those in a standard feedforward neural network, where each neuron in a layer is connected to every neuron in the next layer The fully connected layers use the features learned by the convolutional and pooling layers to make a prediction or classification of the input image.

In addition to these basic components, CNNs may also include other layers such as activation layers, dropout layers, and normalization layers Activation layers apply a non-linear function to the output of the previous layer, introducing non-linearity into the network Dropout layers randomly drop out some neurons during training, reducing the risk of overfitting Normalization layers normalize the output of a layer, making it more robust to changes in the input distribution.

Improving the quality of CNNs is essential to achieve better performance in computer vision tasks Dropout, Batch Normalization, Data Augmentation, Transfer Learning, and Learning rate Scheduling are algorithms that can be used to enhance the accuracy of CNNs.

Dropout was proposed by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky et al in

2012 It is a simple yet effective technique that helps to prevent overfitting and improve the generalization performance of the network [12].

The basic idea behind dropout is to randomly drop out a certain percentage of the neurons in a layer during training This is done by setting the output of these neurons to zero The remaining neurons are scaled by a factor of 1/(1-p), where p is the dropout probability During testing, all neurons are used, but their outputs are multiplied by (1- p) to compensate for the dropout during training.

Dropout works by forcing the network to learn redundant representations If a neuron is dropped out during training, the remaining neurons have to learn to compensate for its absence This encourages the network to learn more robust and diverse representations, which in turn helps to prevent overfitting and improve the generalization performance of the network.

Dropout can be applied to one or more fully connected layers in a CNN The dropout probability is usually set to a value between 0.1 and 0.5, depending on the size of the network and the complexity of the task Dropout can also be combined with other

18 regularization techniques, such as L1 or L2 regularization, to further improve the performance of the network.

Batch normalization was introduced by Sergey Ioffe and Christian Szegedy in 2015 and has since become a standard technique in deep learning [13].

Batch normalization normalizes layer activations across mini-batches during training, mitigating internal covariate shift Internal covariate shift refers to the change in activation distribution within each layer, which can hinder training stability and convergence By reducing this shift, batch normalization promotes smoother training, preventing overfitting and enabling better model performance.

Batch normalization works by computing the mean and standard deviation of the activations of a layer across the mini-batch of samples These statistics are then used to normalize the activations using the following formula: y = (x - mean) / sqrt(var + eps) where x is the input to the layer, mean and var are the mean and variance of the activations across the mini-batch, and eps is a small constant added for numerical stability The normalized output y is then scaled by a learnable parameter gamma and shifted by a learnable parameter beta.

Batch normalization can be implemented at any stage of a convolutional neural network (CNN), encompassing both convolutional and fully connected layers Its placement can vary before or after the activation function, with the former being the more prevalent approach.

Batch normalization has several benefits for deep neural networks First, it reduces the internal covariate shift, making the training process more stable and faster Second, it acts as a regularizer, reducing the risk of overfitting Third, it helps to improve the generalization performance of the network by reducing the dependence of the network on specific features of the input distribution.

Data augmentation is a technique used to increase the size of a training dataset by artificially generating new samples It helps improve the performance of the model by reducing overfitting and improving the generalization of the model [14].

OBJECT DETECTION AND YOLO ARCHITECTURE

Object Detection .e Ố ố Ả

Object detection is a computer vision problem that involves identifying objects of interest within digital images or video frames The goal of object detection is to not only recognize the presence of objects in an image but also to determine their location and extent within the image [18].

Object detection is an important task for a wide range of applications, including self- driving cars, robotics, surveillance, and security systems The process of object detection involves the following steps:

+ Object proposal generation: Generate a set of potential object locations in the image using a technique such as sliding window or region proposal.

+ Feature extraction: Extract features from each proposed object region using techniques such as CNNs.

+ Object classification: Classify each proposed object region as containing an object or not using a binary classifier such as a support vector machine (SVM) or logistic regression.

+ Object localization: If the proposed object region contains an object, use regression to predict the object's bounding box coordinates within the image These algorithms useCNNs to extract features from proposed object regions and perform object classification and localization.

Segmentation 0a

Image segmentation, a crucial aspect of computer vision, entails partitioning an image into distinct segments based on its visual content Unlike object detection, which focuses on identifying whole objects, image segmentation delves deeper into identifying and labeling individual pixels or pixel clusters within an image, providing a more fine-grained analysis of the image's content.

There are different types of image segmentation techniques, including thresholding, edge detection, clustering, and deep learning-based methods Deep learning-based methods have achieved state-of-the-art performance on image segmentation tasks.

Image segmentation has many applications, including medical imaging, video surveillance, and autonomous vehicles The process of image segmentation involves the following steps:

+ Preprocessing: Preprocess the image by resizing it to a standard size and normalizing the pixel values.

+ Feature extraction: Extract features from the image using techniques.

+ Pixel grouping: Group pixels based on their similarity in color, texture, or other visual features.

+ Segmentation: Assign labels to each pixel or group of pixels based on their visual characteristics.

RESULTS - -G- Ă 19T nh nh nh TT TH HH Hành 37 1 Metrics for Performance EVaẽuatiOI -:- 6 + + * 2xx k*vEEEskEseskeskererree 37 2 Quantitative Analysis 1710576

Qualitative Analysis 177

The detection results of the four best models are presented in Figure 7 The left column describes a man reviewing protective gloves at a near distance with close-range objects.

All models can detect easily with high confidence (more than 80%), but we can see that YOLOv8 models have lower confidence than the 9th generation.

While both models make similar predictions, the barriers have similar results With objects like gloves, the 9th generation YOLO model sometimes made wrong predictions, while the 8th generation had more accurate detection But the YOLOV9-C model, when detecting remaining objects such as people, hard hats and protective gear,always ensures significantly higher confidence than its predecessor.

Enhancing Model Efficiency with SAHI 00 cece ceseesseeeseeeeneeeseeenseeeeteeeseensaes 41 5 Developing Automatic Surveillance System using StreamÌI(

SAHI can provide some crucial advantages for PPE detection in real construction sites. Firstly, the model can detect and track small items in crowded areas where it’s common in public surveillance and security management systems Secondly, SAHI ensures high quality through fault identification and quality control inside industrial operations, this is especially important to PPE detection because of the condition that surveillance cameras inside construction zones are easily crowded with movements or noises. Finally, it provides robust object manipulation in automated tasks require accurate identification of small items.

The improvement in efficiency of YOLO models when using SAHI is demonstrated in the two comparison photos In the original photo, only one boot is detected due to the small size and far distance of this object After applying SAHI, there are more boots are recognized by the model.

5 Developing Automatic Surveillance System using Streamlit

Streamlit is a free and open source framework for quickly building and sharing machine learning and data science web applications [33] This is a Python-based library designed specifically for machine learning engineers Streamlit allows users to create a great- looking app with just a few lines of code. x Deploy ù

Drag and drop file here

Figure 12 The home page of automatic surveillance system

From that library and the results from training process, we deployed an automatic surveillance system automatically monitor compliance with worker protective equipment (PPE) regulations on construction sites The goal of the website is to detect workers working with or without PPE in real-time which can assist the manager in evaluating the status of workers’ compliance and handling cases of violations immediately. a Machine Learning Model Config

The ML model configuration consists of two primary sections: "Select Model" and

Users can choose between monitoring PPE compliance or worker location under "Select Model." Additionally, they can customize the confidence level from 50 to 100 under "Select Model Confidence" to align with their specific requirements.

Figure 13 Example of the Image Detection Function

The image/video configuration interface features a "Select Source" section with four options: Image, Video, Webcam, and RTSP.

When selecting the "Image" option, users are prompted to choose an image file, with a box displaying the request "Limit 200MB per file, JPG, JPEG, PNG, BMP, WEBP," allowing for drag-and-drop functionality Upon uploading the image, the system automatically identifies objects within the image based on available classes, displaying

43 the recognized objects while indicating processing with the word "running" in the corner of the screen.

In the "Video" option, users are prompted to choose a video file, with a drag-and-drop box displaying the request "Limit 200MB per file: MP4, AVI, MOV, MPEG4" akin to the image upload functionality Once the video is displayed on the screen, users can initiate object detection by clicking on the "detect video objects" button The system will generate identification results, indicated by the cessation of the "running" status displayed in the corner of the screen.

Additionally, the main screen presents two significant components: "Display Tracker" and "Tracker." Under "Display Tracker," users are given the option to enable or disable display tracking, represented by yes/no choices.

Figure 14 Example of the Webcam Detection Function

In the "Tracker" section, two elements are included: "bytetrack.html" and

"botsort.yaml." "Bytetrack.html" utilizes a straightforward and efficient linking method to track by linking every detected box, incorporating similarity to traces to recover real objects and filter out background detections for low-spot detection boxes [34] while

"Botsort.yaml" introduces a robust, advanced tracker that amalgamates motion and appearance information, integrates camera motion compensation, and employs precise Kalman filter state vectors for enhanced tracking accuracy [35].

In the "Webcam" option, users can initiate dynamic recognition directly through their camera by clicking on the "Detect Objects" button Upon detection, if the system identifies a person not fully equipped with personal protective equipment (PPE), an audio warning will be issued.

In the "RTSP" option, the main screen displays an input field where users can enter the

IP address to connect to another device's camera through an intermediary application named Droid Cam Results are displayed on the main screen, and similar to the Webcam option, the system will issue an audio warning when it detects any violations. c Worker Tracking

Tracking the position of workers on construction sites is crucial for ensuring safety, efficiency, accountability, productivity, and compliance By knowing where workers are located, supervisors can quickly respond to emergencies, optimize resource allocation, maintain accountability, analyze productivity levels, and ensure adherence to regulations This information not only enhances the overall safety of the site but also

Figure 15 View of construction sites from a surveillance camera

When users interact with the Worker model, they gain the ability to determine the location and movements of workers by utilizing images captured from cameras.

Center Coordinates of Detected Objects

Figure 16 Coordinates of workers from the surveillance camera

By leveraging computer vision technology, these images are processed to track the position of workers in real-time This capability empowers users to monitor worker activity remotely, ensuring safety protocols are followed, tasks are efficiently assigned, and potential hazards are swiftly addressed Moreover, it facilitates data-driven decision-making by providing insights into worker productivity and site operations. Overall, integrating camera-based image analysis with the Worker model enhances site management and contributes to a safer and more efficient construction environment.

In summary, our study provides a comprehensive examination of object detection models for Personal Protective Equipment (PPE) detection in construction sites.

It underscores the superiority of the YOLOv9c model, which exhibits higher Average Precision (AP) scores across critical classes such as Person, Hardhat, and Vest This signifies the model's ability to accurately identify and localize PPE-related objects, thus enhancing workplace safety and regulatory compliance Besides that, the qualitative analysis sheds light on the nuanced performance differences between YOLOv8 and YOLOv9 models While YOLOv9 demonstrates higher confidence in detecting certain objects, there are instances where it may exhibit lower accuracy compared to YOLOv8. This highlights the ongoing need for refinement and optimization in model development to address specific challenges in PPE detection effectively.

Integrating SAHI enhances object detection, particularly in identifying small and rare samples This improves the model's efficiency and accuracy, boosting the overall effectiveness of the system By strengthening object detection capabilities, SAHI contributes to enhanced workplace safety measures.

Tiêu đề	Developing Automatic Surveillance System For Personal Protective Equipment Compliance In Real Construction Sites Using Deep Neural Network Models
Tác giả	Nguyen Duy Thuc
Người hướng dẫn	Kim Dinh Thai, Ha Manh Hung
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Business Data Analytics
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	51
Dung lượng	25,81 MB