Recognition and segmentation of road surface damage using deep learning technology

Recognition and segmentation of road surface damage using deep learning technology Recognition and segmentation of road surface damage using deep learning technology

Introduction

Context

Surface damage on roads, such as cracks, potholes, and deformations, significantly impacts traffic safety and the longevity of infrastructure Prompt detection and repair of these issues are crucial for minimizing accident risks and lowering long-term maintenance expenses Unfortunately, conventional methods that depend on manual inspections or physical sensors tend to be time-consuming, expensive, and often lack precision.

The rise of deep learning technology has revolutionized the automation of road surface damage detection and segmentation Object detection models like YOLO (You Only Look Once) stand out for their impressive speed and accuracy The newest iteration, YOLOv11n, is specifically designed for low-resource devices, ensuring rapid and precise detection, which enhances its practicality for real-world applications.

This study investigates the application of YOLOv11n for detecting and segmenting road surface damage, aiming to enhance detection accuracy and assess the potential for integrating this technology into practical uses like automated road monitoring and infrastructure maintenance systems.

Problem Statement

The swift advancement of transportation has led to significant road surface damage, posing risks to both safety and efficiency within the transportation system Currently, the detection and segmentation of road damage predominantly depend on manual techniques or outdated monitoring systems, which are both time-consuming and expensive Consequently, implementing automated image recognition technology emerges as a practical solution to enhance the efficiency and precision of road damage detection.

The thesis "Road Damage Detection and Segmentation Using Deep Learning Technology (YOLOv11n)" aims to develop an automated system leveraging the advanced YOLOv11n model to effectively detect and segment damaged road surfaces This innovative approach seeks to minimize the need for manual inspections while enhancing the accuracy of identifying and classifying different types of road damage, including cracks, potholes, subsidence, and other defects.

This solution not only saves time and costs for transportation authorities but also enhances road safety by quickly identifying and addressing surface damage issues.

Research question

1 How can the YOLOv11n model be optimized for detecting and segmenting road surface damage?

2 What is the accuracy and effectiveness of the YOLOv11n model in identifying and classifying road damage types such as cracks, potholes, and subsidence?

1 How should the image data be preprocessed to optimize detection results?

2 How do the YOLOv11n parameters (such as IoU threshold, confidence score) affect the accuracy of the results?

3 Can the model perform well under different conditions (e.g., bad weather, low light, or varying camera angles)?

4 How does this system compare to current methods (such as manual inspection or traditional models) in terms of time and cost efficiency?

5 How can the system be deployed as a web application capable of processing and displaying real-time results?

Objectives

Develop an automated system for road damage detection and segmentation using deep learning technology (YOLOv11n) to accurately identify and classify road surface defects, improving efficiency and reducing manual inspection.

- Implement and optimize the YOLOv11n model for detecting and classifying various types of road surface damage, such as cracks, potholes, and subsidence.

- Collect and preprocess a comprehensive dataset of road damage images to train the deep learning model.

- Evaluate the performance of the YOLOv11n model in detecting road surface damage on real-world datasets, comparing it to existing methods.

- Develop a web application capable of processing and displaying results for real-time road damage detection and segmentation.

Scope and Limitations

This project aims to create an automated system for detecting and segmenting road damage using deep learning, specifically the YOLOv11n model It will identify various types of road defects, including cracks, potholes, and subsidence Key components of the project involve constructing a dataset of road damage images, training and optimizing the YOLOv11n model, assessing its performance with real-world data, and designing a user interface However, the project will not extend to developing a full road monitoring system or integrating the model with existing traffic management solutions.

Limitations: The project has the following limitations:

The training dataset may lack comprehensive representation of all real-world road damage types due to limitations in both the quantity and quality of images included.

- The system focuses only on detecting and classifying basic road damages and cannot handle more complex issues such as deep structural road damage or environmental factors.

- The project does not include large-scale deployment or integration into existing traffic monitoring systems.

Significance of the Study

This study enhances road surface damage detection and management methods by enabling timely and precise identification of defects like cracks, potholes, and subsidence By utilizing deep learning technology (YOLOv11n), the automated system significantly improves the accuracy and efficiency of road damage detection This advancement aids transportation authorities in optimizing maintenance efforts and resource allocation, ultimately reducing accidents and lowering maintenance costs while improving infrastructure quality.

This research enhances computer vision techniques for infrastructure management and promotes the evolution of smart traffic systems, ultimately leading to improved road safety and lower infrastructure maintenance costs.

Literature Review

OVERVIEW OF AI, DEEP LEARNING

2.1.1 Artificial Intelligence (AI) and its applications in society.

2.1.1.1 The concept of artificial intelligence.

Artificial Intelligence (AI) is a branch of computer science focused on developing machines that can learn, understand, and act independently to solve complex problems traditionally handled by humans By enabling computers to perform intelligent tasks, such as image recognition and autonomous driving, AI has the potential to match or exceed human capabilities.

The main fields of AI include:

Machine Learning (ML): Machine learning is a part of AI that involves computers learning from the data provided to them without being explicitly programmed.

Computer Vision: Computer vision focuses on enabling computers to recognize and understand images and videos.

Natural Language Processing (NLP): Natural language processing involves a computer's ability to understand, interpret, and generate human language automatically.

Robotics: This field focuses on creating and controlling robots that can learn and perform tasks as required.

AI has a wide range of applications in society and daily life, from personal applications to various fields such as industry and public services:

Industry and Manufacturing: In the industrial sector, AI is used to optimize production processes, forecast market demand, and enhance machine and robot operation efficiency.

Healthcare: AI is utilized in medical image analysis, disease diagnosis, predicting complications, and drug development.

Finance and Banking: AI is applied in automated financial transactions, market data analysis, risk assessment, and credit risk reduction.

Transportation: In transportation, AI is used for smart traffic control, predicting traffic conditions, and developing self-driving vehicles.

Customer Service: AI is used in chatbots for automatic responses, online customer support, and automated feedback.

Education: In education, AI can create personalized learning models, provide instant feedback to students and teachers, and assist in analyzing learning performan

Figure 2.2: Artificial intelligence applied in everyday life

These applications are just a small part of the many ways in which artificial intelligence is being used to improve life and bring significant advancements in various sectors of society.

Figure 2.3: Overview of Deep Learning

Deep Learning is a branch of Machine Learning (ML) focused on using deep neural networks to learn and understand complex data The main characteristics of deep learning include:

Deep learning leverages deep neural networks characterized by multiple layers to effectively learn and extract features from data, often comprising hundreds or even thousands of neuron layers.

Figure 2.4: Neural Networks in Deep Learning

Automation: Deep learning allows the automation of the learning process from data. Deep neural networks can learn the complex structures and features of data without human intervention.

Versatility: Deep learning can be applied to various types of data, including images, text, audio, and sequential data.

High Performance: Due to the large number of neural layers and the ability to learn from large datasets, deep learning can achieve high performance in prediction and classification tasks.

High Complexity: Deep learning models can understand and process complex, unstructured data patterns.

High Resource Requirements: Due to the large number of layers and parameters, training deep learning models requires substantial computational resources and time.

2.1.2.2 Significant Contributions of Deep Learning in Artificial Intelligence

Deep learning has greatly advanced Artificial Intelligence (AI), enabling innovative applications and enhancing the effectiveness of conventional AI tasks Its key contributions include improved data processing capabilities, increased accuracy in predictions, and the ability to learn complex patterns from large datasets.

Deep learning has significantly enhanced object detection, facial recognition, and image classification capabilities, making it essential for various applications Key uses include self-driving cars, traffic sign recognition, and medical image classification, all of which depend on advanced deep learning models for improved performance.

● Natural Language Processing: In the field of Natural Language Processing

(NLP), deep learning has enhanced the performance of machine translation, text summarization, and sentiment analysis models Virtual assistants like Siri, Alexa, and Google Assistant also use deep learning techniques.

● Automation of Tasks: Deep learning has helped create automated systems in various industrial sectors, ranging from autonomous driving to process automation and market demand forecasting.

● Forecasting and Prediction: Deep learning models have significantly improved forecasting abilities in many areas, including weather forecasting and financial predictions.

● Healthcare and Pharmacology: In healthcare, deep learning has been used for classifying MRI images, predicting diseases based on clinical data, and discovering new pharmaceuticals.

● Automation and Robotics: Deep learning has also enhanced the performance and automation capabilities of robots in applications like vacuum cleaning robots, medical robots, and autonomous delivery robots.

Deep learning significantly enhances the capabilities of artificial intelligence, unlocking numerous opportunities for its integration into daily life and expanding its applications across various sectors.

Overview of YOLO

2.2.1 Introduction to the YOLO Model

YOLO (You Only Look Once) is a renowned model for object detection in both images and videos, celebrated for its remarkable speed and precision in identifying objects efficiently.

Figure 2.5: YOLO_Greatest_Hits_Timeline

Methods for Object Detection and Classification in Images

- Grid: YOLO divides the image into a grid of square cells Each cell predicts a certain number of bounding boxes and the class probabilities for the objects within that cell.

- Bounding Boxes: Each predicted bounding box contains an object and is defined by its coordinates (top-left and bottom-right coordinates of the box).

YOLO predicts the probability of each object class for every bounding box it identifies For instance, when detecting cars, it specifically estimates the likelihood of the "car" class appearing within that bounding box.

- Regression: YOLO uses a regression model to predict the size and location of the bounding box.

Non-max suppression is a crucial step in the YOLO algorithm, where it eliminates redundant bounding boxes after the neural network generates predictions This process ensures that only the bounding box with the highest class probability is retained, enhancing the accuracy of object detection.

The difference between YOLO and traditional methods

YOLO is renowned for its speed, significantly outpacing traditional methods by processing each image with a single pass of the neural network, unlike other techniques that necessitate multiple runs across various regions of the image.

Global Context: YOLO uses the entire image to make predictions, which helps it detect smaller and more distant objects compared to traditional methods.

The YOLO model operates in an end-to-end manner, eliminating the need for intermediate stages such as feature extraction commonly found in traditional methods This approach streamlines both the training and deployment processes, enhancing efficiency.

Quick Object Classification: Since YOLO predicts all the bounding boxes and object classes simultaneously across the entire image, it can classify objects quickly and accurately.

The YOLO model employs a sequence of convolutional layers to effectively extract features from input images, with deeper layers focusing on general characteristics like edges and textures, while shallower layers capture more specific and intricate details.

In YOLO, the detection layers play a crucial role in predicting bounding boxes and class probabilities for objects within each grid cell Each layer corresponds to a specific section of the image, generating multiple bounding boxes along with the likelihood of various object classes.

Bounding box prediction involves determining the coordinates and dimensions of an object within an image Each bounding box is represented by a vector consisting of five components: (x, y, w, h, confidence) Here, (x, y) denotes the coordinates of the top-left corner, while (w, h) indicate the width and height of the box The confidence score reflects the likelihood that the bounding box accurately contains an object.

In class prediction, each grid cell forecasts the likelihood of various object classes For instance, in a task focused on detecting cars and people, every cell will estimate the probabilities for both the "car" and "person" categories.

The loss layers are used to calculate the loss of the YOLO model YOLO uses a combination of three types of loss:

● Localization Loss: Measures the discrepancy between the predicted and actual coordinates and size of the bounding box.

● Confidence Loss: Measures the discrepancy between the predicted and actual probabilities of a bounding box containing an object.

● Classification Loss: Measures the discrepancy between the predicted and actual probabilities of the object classes.

The combination of these three types of loss is used to adjust the weights of the YOLO model during the training process.

Divide the image into a grid of square cells

Prior to initiating the prediction process, the input image is segmented into a grid of square cells, with each cell associated with prediction data that includes bounding boxes and the class probabilities of the objects contained within that cell.

Predicting bounding boxes and class probabilities of objects

After the image is divided into a grid of square cells, each cell will predict a number of bounding boxes and class probabilities of objects.

● Bounding Boxes Prediction: Each bounding box is predicted by a vector with 5 components: (x, y, w, h, confidence), where:

○ (x, y) represents the coordinates of the top-left corner of the bounding box.

○ (w, h) are the width and height of the bounding box.

○ Confidence is the general probability that this bounding box contains an object.

In class prediction, each grid cell forecasts the probabilities of various object classes For instance, when detecting cars and people, every cell can estimate the likelihood of the presence of both "car" and "person" classes.

After the prediction process, the model generates numerous bounding boxes with associated class probabilities To refine these results and retain only the most precise bounding boxes, YOLO utilizes a method known as "Non-max Suppression."

Step 1: Remove bounding boxes with a confidence score below a defined threshold

Step 2: Sort the remaining bounding boxes by the highest class probability score. Step 3: Starting from the bounding box with the highest probability, remove any bounding box with an IoU (Intersection over Union) greater than a fixed threshold (e.g., 0.5) This ensures that overlapping bounding boxes are discarded, keeping only the best one.

Upon completion of this process, we will acquire a list of highly accurate bounding boxes along with the class probabilities of the objects contained within them, representing the final outcome of the YOLO model's prediction.

YOLO Versions

YOLOv1, developed by Joseph Redmon and his team, revolutionized object detection by implementing a single-step approach that divides images into grids Each grid cell is responsible for predicting bounding boxes and object labels, which allows for rapid processing Despite its advancements, YOLOv1 faced challenges in accurately detecting small objects or those that were closely positioned.

YOLOv2 enhanced the capabilities of YOLOv1 through the integration of Batch Normalization, Anchor Boxes, and Fine-Grained Features, leading to improved accuracy and fewer errors in small object detection It utilized Darknet-19, a specially designed deep learning architecture, as its backbone, and offered support for variable input sizes, providing greater flexibility for various applications.

YOLOv3 enhanced its performance by integrating the Darknet-53 network for improved feature extraction and employed the Feature Pyramid Network (FPN) for multi-scale detection This upgrade led to a notable increase in accuracy, particularly for small object detection While YOLOv3 experienced a slight reduction in processing speed compared to YOLOv2, it successfully achieved a balance between speed and accuracy.

YOLOv4, independently developed by Alexey Bochkovskiy, introduced significant innovations like CSPDarknet53, Mosaic Augmentation, and CIOU Loss, which enhanced detection performance and training efficiency on standard GPUs Its remarkable balance of speed, accuracy, and practical deployment capabilities led to widespread adoption in the field.

YOLOv5, developed by Ultralytics and built on the PyTorch framework, simplifies deployment and customization in deep learning This version offers various model sizes, including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x Enhancements in speed and support for model export to ONNX, TensorRT, and CoreML make YOLOv5 ideal for mobile application integration.

YOLOv6 is tailored for industrial applications, delivering exceptional performance in detection tasks It features an enhanced backbone design for effective feature extraction and employs Reparameterization techniques to optimize inference Furthermore, YOLOv6 is specifically optimized for high-performance hardware, including GPUs and CPUs.

YOLOv7, created by Wang et al., revolutionized real-time object detection by achieving high accuracy and was the first to incorporate the Extended Efficient Layer Aggregation Network (E-ELAN) This model demonstrated exceptional performance on the COCO (Common Objects in Context) dataset and is suitable for a wide range of real-time applications.

Ultralytics launched YOLOv8, the most advanced version to date, boasting an enhanced architecture that supports both segmentation and detection This version includes optimized models that cater to varying needs for speed and accuracy, from lightweight to heavy options, and ensures seamless integration with contemporary deep learning tools.

YOLOv9 enhances object detection capabilities through innovative techniques like Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN), resulting in improved accuracy in complex real-world scenarios This version also preserves high processing speeds, making it well-suited for real-time applications.

YOLOv10 leverages the power of CNN and Transformer technologies to enhance feature extraction and complex object detection By integrating techniques like MixUp and CutMix, it significantly improves generalization capabilities This model is optimized for diverse hardware platforms, achieving exceptional speed and accuracy Its applications span across various domains, including traffic monitoring, healthcare, and industrial robotics.

YOLOv11 represents the newest iteration in the YOLO model series, designed to significantly boost performance and accuracy for real-time object detection With a range of remarkable enhancements, this version effectively addresses the intricate requirements of contemporary computer vision.

1 Architecture and Key Components a Backbone Architecture:

Backbone: YOLOv11 employs a novel backbone combining traditional CNN andTransformer networks This approach efficiently extracts complex image features,

CNN: Provides spatial feature extraction, particularly for objects with clear shapes and structures.

Transformer: Enhances the backbone's ability to learn spatial relationships between objects, enabling detection of complex objects and their interactions. b Neck and Head:

YOLOv11 enhances multi-scale detection through the use of Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) FPN significantly improves the detection of small objects, while PAN enhances the performance for objects of different sizes by effectively aggregating information from multiple layers.

The YOLOv11 head is specifically engineered to enhance the accuracy of bounding box predictions and object labeling, significantly reducing the occurrence of false positives and false negatives, especially in intricate situations involving overlapping objects.

A new technique that improves the training process by allowing dynamic gradient adjustments, optimizing learning on heterogeneous datasets, and enhancing generalization and training efficiency.

Generalized Efficient Layer Aggregation Network (GELAN):

GELAN enhances layer aggregation efficiency, maintaining high accuracy while reducing the number of parameters This not only cuts computation costs but also facilitates deployment on resource-constrained devices.

These techniques reduce model size without compromising performance. Quantization lowers numerical precision to save memory, while pruning removes unnecessary connections, speeding up inference without accuracy loss.

YOLOv11 excels in object detection, achieving the highest mean Average Precision (mAP), particularly for small and complex objects, making it superior to earlier versions and competing models.

YOLOv11 is designed for versatility, functioning seamlessly across personal computers, mobile devices, and robots Its advanced architecture does not compromise inference speed, ensuring it remains an excellent option for real-time applications.

By leveraging FPN (Feature Pyramid Network) and PAN (Path Aggregation Network), YOLOv11 can detect objects of varying sizes, from small objects in the scene to large-scale targets.

Methodology

Proposed structure Deep Neural Network (DNN)

Figure 3.1: DNN Structure for Road Damage Detection

The road surface condition monitoring system utilizes a variety of devices, primarily vehicle-mounted cameras, to gather essential data These cameras include RGB, infrared, and 3D sensors like LiDAR, enabling a comprehensive data capture To enhance context, environmental sensors such as GPS and accelerometers are integrated during data collection The primary data collected consists of images and videos as vehicles traverse the road, facilitating the analysis of road surface integrity Strategic placement of these devices ensures optimal coverage while reducing occlusions and shadows that could compromise data quality.

The original image is the unprocessed visual data captured by sensors, such as LiDAR, which can be in 2D or converted to 3D formats These images are projected into a 2D space for analysis and often reveal imperfections in road conditions, including cracks, potholes, surface depressions, and discolorations indicative of deterioration Due to the raw nature of this data, it may lack clarity, emphasizing the necessity for preprocessing and enhancement to improve its quality.

After acquiring the original images, they undergo an image processing module to improve their quality and ready them for the deep learning model This phase may include various preprocessing techniques.

Proper image alignment is crucial for accurate analysis, especially when data is collected from various angles or perspectives Correcting the orientation of images ensures that they are positioned correctly before examination.

Color space transformation is essential for effective road damage detection, as different types of damage, such as cracks and potholes, are more discernible in specific color spaces For instance, using the HSV color space enhances the visibility of cracks, while the RGB color space is better suited for identifying potholes By converting images into the most relevant color space, critical features are highlighted, improving the accuracy of damage detection.

Data augmentation is a technique that enhances training datasets by creating modified versions of images By applying methods like rotation, flipping, scaling, and brightness adjustment, this approach helps models generalize better and reduces the risk of overfitting.

Denoising is essential for enhancing road images, as they frequently suffer from noise caused by environmental factors, poor lighting, or sensor limitations By utilizing denoising algorithms, we can effectively clean the data and preserve the most significant features for better analysis and interpretation.

After preprocessing, image data is input into a deep learning model, commonly a convolutional neural network (CNN) or the advanced YOLOv11 (You Only Look Once version 11) This model conducts image segmentation, categorizing the image into distinct regions that represent various types of road surface damage, such as cracks, potholes, and depressions YOLOv11 excels in real-time object detection and segmentation, delivering high accuracy and speed It not only segments the image but also classifies each segment into predefined damage categories The results are visually represented with highlighted areas indicating the identified damages, facilitating further analysis, reporting, or intervention planning.

Experimental setup

This research applies deep learning methods for road surface damage detection and segmentation using the YOLOv11n model The experimental setup includes the following main steps:

The road damage image dataset is compiled from publicly accessible sources, featuring various damage types like cracks, potholes, and depressions It includes images from open databases, research initiatives, and road surveillance systems, ensuring diversity in conditions such as weather, time of day, and road types Essential preprocessing steps are undertaken to prepare this data for model training.

To optimize the YOLOv11n model, images undergo cropping and resizing, which preserves essential features of the road surface while minimizing computational complexity.

Normalization: Images are normalized to have consistent brightness and contrast to minimize lighting variations, which helps the model focus on the structural features of the road damage.

Data Augmentation: Techniques like rotation, flipping, scaling, and color adjustments are applied to increase the diversity of training data, improving the model’s generalization ability.

The YOLOv11n model is trained on the preprocessed dataset The training process

Hyperparameter optimization is crucial for improving model performance, focusing on key parameters like learning rate, number of epochs, batch size, and weight initialization Techniques such as grid search and random search are employed to identify the optimal combination of these hyperparameters.

The model is trained by utilizing a loss function that minimizes prediction errors, while backpropagation is employed to update the model's weights, leading to continuous improvements in accuracy.

Cross-validation is a crucial process in model evaluation, where the model is tested on various datasets to ensure its ability to generalize effectively This approach involves training and validating the model on distinct datasets, which helps prevent overfitting and verifies that the model maintains strong performance across diverse road conditions.

Following the training phase, the model undergoes evaluation using the test set to measure key performance indicators such as accuracy, recall, F1 score, Intersection over Union (IoU), precision, and mean Average Precision (mAP) for road damage detection and segmentation These metrics are essential for assessing the model's effectiveness in practical, real-world scenarios.

Data Analysis is the process of evaluating the results obtained from the YOLOv11n model to determine its performance and applicability in practice The analysis steps include:

To evaluate the effectiveness of the YOLOv11n model in detecting and classifying road damage, key metrics such as accuracy, recall, F1 score, and mean Average Precision (mAP) are utilized This model is benchmarked against traditional methods and earlier versions of YOLO to highlight advancements in accuracy.

Model errors are analyzed to identify causes and suggest corrective actions Errors may include misidentifying damages or inaccurately segmenting damaged areas, particularly for small or complex damages.

The model's effectiveness for road monitoring and maintenance is assessed using real-world images across diverse environmental conditions, including varying weather, lighting, and road surfaces This evaluation confirms the model's capability to address practical challenges, such as fluctuations in road texture, seasonal changes, and low visibility scenarios.

Tools

The tools and software used in the research include:

In this research, Python serves as the main programming language for implementing the YOLOv11n model and conducting data preprocessing Its robust libraries, including NumPy for numerical operations, OpenCV for computer vision tasks, and Matplotlib for data visualization, enhance image processing and data manipulation With its versatility and strong community backing, Python is an excellent choice for deep learning and image processing endeavors.

TensorFlow and Keras are widely-used deep learning libraries essential for implementing, training, and evaluating the YOLOv11n model TensorFlow delivers robust model optimization tools, while Keras provides an intuitive interface for constructing and training sophisticated deep learning models Together, TensorFlow and Keras offer the flexibility and high performance needed for effective deep learning applications.

YOLOv11n is an optimized, lightweight version of the YOLO model, renowned for its rapid and precise object detection This research utilizes YOLOv11n to effectively detect and segment road surface damages By integrating the model via the Ultralytics library, the deployment process is streamlined, significantly improving detection and segmentation efficiency to meet practical application needs.

LabelImg is a powerful image labeling tool designed for creating labeled datasets essential for model training It enables users to accurately annotate damaged areas in images and conveniently export the labeled data in YOLO format, ensuring seamless compatibility with the YOLOv11n model for effective training.

Google Colab is a cloud-based platform that facilitates running experiments and training machine learning models It offers free access to GPUs and TPUs, significantly speeding up the training process and lowering hardware costs By utilizing Google Colab, users can leverage powerful computational resources for deep learning, while also easily sharing their results with others.

Flask is a lightweight Python framework ideal for developing web applications, serving as a bridge between users and the YOLOv11n model It facilitates image and video uploads, processes them through the model, and presents results on a user-friendly web interface Its flexibility enables seamless expansion and integration of additional features into the system.

OpenCV(Open Source Computer Vision Library)

OpenCV is a robust library for image and video processing utilized in this research to analyze video frames, draw bounding boxes, and mask damaged regions With support for multiple video formats and advanced image processing algorithms, OpenCV guarantees both accuracy and efficiency in handling visual data.

Pillow is a versatile Python library for 2D image processing, utilized in this research to apply Gaussian Blur effects and manage segmentation masks on images When combined with OpenCV, it offers enhanced flexibility for complex image processing tasks Additionally, ffmpeg is a robust command-line tool employed to convert processed videos into H.264 format, ensuring web browser compatibility for displaying results This tool not only optimizes online video playback but also minimizes file size.

Visual Studio Code (VS Code) is a powerful integrated development environment (IDE) designed for writing, editing, and debugging Python code Its extensive library of extensions and intuitive interface enhance productivity while effectively organizing workflow.

The user interface, developed with HTML and CSS via Flask's Template Engine, enables users to easily upload image or video files and view segmentation results directly in their web browsers Its straightforward and effective design enhances user interaction with the system.

Metric

Intersection over Union (IoU) is a crucial metric for assessing the accuracy of predicted object locations against the ground truth It quantifies the overlap between the predicted bounding box and the actual bounding box of an object, providing a clear measure of prediction performance The IoU is calculated using a specific formula that reflects this overlap.

- Area overlap : is the area of the intersection between the two bounding boxes (the predicted bounding box and the actual bounding box).

The area union refers to the combined area of two bounding boxes, calculated by adding the area of the predicted bounding box to the area of the actual bounding box and then subtracting the area of their intersection.

Precision measures the accuracy of a model's positive predictions, emphasizing the reduction of false positives, which are instances incorrectly predicted as positive This metric is crucial in scenarios where the consequences of false positives are significant.

Recall, also referred to as Sensitivity or True Positive Rate, assesses a model's ability to accurately identify positive instances It is particularly crucial in scenarios where the consequences of false negatives—failing to detect a true positive—are significant For instance, in object detection tasks, maintaining a high recall rate is vital to ensure that the majority of objects are successfully identified.

F1-Score: F1-Score is the harmonic mean of precision and recall, providing a balanced measure between the two.

The F1-score gives a more comprehensive view of the model’s performance, especially when there is an imbalance between precision and recall.

Mean Average Precision (mAP) is a widely recognized metric for assessing the performance of object detection models It computes the average Precision at various Recall thresholds, providing insight into the model's accuracy across multiple threshold levels.

Results

Data Collection

- Mobile Cameras: Mobile cameras are flexible and portable devices that collect data from a variety of angles and locations Mobile devices such as smartphones or mobile cameras can be used.

- Drones: Drones provide the ability to capture images from heights and angles that traditional mobile cameras cannot achieve This allows data collection from large and difficult-to-reach areas.

1 Planning: Identify the area to be collected, specific objectives (object recognition, segmentation, etc.), and requirements for resolution and image update frequency.

2 Equipment preparation: Check and charge the battery of the mobile camera or drone Ensure sufficient storage capacity for data collection.

3 Parameter setting: Configure the camera/mobile phone or drone to ensure the resolution and image update frequency are suitable for the data collection objectives.

4 Data collection:Mobile camera: Move through the areas to be collected, taking photos from different angles and distances Ensure there are enough photos with different lighting conditions and environments.

5 Processing collected data: Mobile camera: Transfer images to computer, store and sort by time and location Remove unused images and prepare data for model training.

6 Annotation and evaluation: Note information related to images such as location,time, lighting conditions Evaluate the quality of the data and modify or add more data if necessary.

Collecting data with mobile cameras necessitates meticulous planning, effective device configuration, and precise data processing to guarantee the quality and accuracy of the subsequent object recognition model.

Data preparation: labeling and data preprocessing

4.2.1 Labeling pavement damage and implementation tools

Labeling involves marking and identifying the locations and types of damage on road surfaces within images, enabling the model to learn effectively from this labeled data.

- Damage type: Determine the types of pavement damage that need to be identified, such as road cracks, subsidence, subsidence, odd, etc.

- Marking bounding boxes: Using labeling software, mark the bounding boxes that determine the location and size of each damage in the image.

- Classification: Label each bounding box with the corresponding damage type.

To create a label file, store the label information for each image in formats like YOLO (You Only Look Once), VOC (Visual Object Classes), or COCO (Common Objects in Context).

Makesense AI is an advanced machine learning and data labeling tool that automates the labeling of images and data for image processing and machine learning projects With its user-friendly interface and support for various data formats, Makesense AI accelerates model development while enhancing data quality.

Figure 4.1: The data labeling tool "Makesense"

Makesense AI enhances labeling capabilities by not only offering bounding box support but also enabling segmentation labeling This allows users to precisely define object boundaries within images, improving the model's comprehension of specific areas As a result, the accuracy and interpretability of the model's representation are significantly enhanced, particularly in object recognition and image classification applications.

Figure 4.2: The Data Labeling Process

Makesense AI enhances the labeling process by enabling users to upload images in batches, thereby improving the quality of input data for machine learning models This functionality boosts automation and efficiency in model development and training Furthermore, Makesense AI's user-friendly interface and diverse features promote seamless interaction for image processing and machine learning projects.

Figure 4.3: Makesense AI's Batch Image Upload Feature

4.2.2 Preprocessing data for the model

After having labeled data, it is necessary to preprocess the data to prepare for the model training process.

To optimize image data for training models, it is essential to resize images to the standard dimensions and apply augmentation techniques This includes rotating, flipping, and adjusting brightness, as well as modifying contrast to enhance data diversity.

- Normalization: Normalize the data to bring the pixel value to the range [0, 1] or [-

1, 1] to help the model learn better.

- Split Train/Validation/Test Set: Divide the data into train, validation and test sets in appropriate proportions, 80% for training, 20% for testing.

- Data balancing: Ensure the balance of images of each type of damage in the training and validation sets to avoid bias in the model.

To optimize model training, it is essential to store preprocessed data in appropriate formats like TFRecord (TensorFlow Record) or HDF5 (Hierarchical Data Format version 5), ensuring efficient data reading by the model.

Data preparation, encompassing labeling and preprocessing, is crucial for ensuring that the model's input data is diverse, accurate, and adequately primed for both training and testing phases.

Model training: Model optimization process

4.3.1 Splitting training and testing data

The training dataset is essential for model development, consisting of accurately labeled images that depict various types of road damage, including cracks, potholes, and depressions Each image is paired with a label that specifies the damage type and its location on the road surface This dataset is meticulously prepared to ensure proper classification of all damage types and features a diverse range of scenarios, encompassing different lighting conditions, road types, and levels of damage severity Such diversity is crucial for enabling the model to effectively learn the distinct characteristics of road damage in real-world environments.

The testing dataset is a crucial, independent set of data not utilized during model training, serving to evaluate the model's performance post-training Its primary purpose is to measure the model's generalization ability, specifically its effectiveness in identifying and classifying road damage it has not previously encountered This dataset is carefully curated to reflect real-world scenarios and conditions, including variations in weather, low light, and uncommon types of damage that the model will face in its deployment environment.

Dataset division: The dataset uses an 80% training set and a 20% testing set, ensuring that the data in both sets is diverse and sufficiently balanced.

- Network architecture: Choose a suitable network architecture, in this case, you can use YOLOv11n-seg or variants of YOLOv11n depending on the specific requirements.

The learning rate is a crucial parameter in model training, as it governs how quickly the model learns Setting an appropriate initial learning rate is essential, and employing techniques like learning rate decay or cyclic learning rates can enhance training performance.

Batch size refers to the number of data samples utilized for each update of the network's weights, with larger batch sizes accelerating the training process while requiring adequate computational resources An epoch signifies a complete pass through the entire training dataset, and the total number of epochs is influenced by both the size of the data and the complexity of the model.

Optimizers play a crucial role in enhancing the speed and efficiency of the training process in machine learning Popular types of optimizers include Adam (Adaptive Moment Estimation), SGD (Stochastic Gradient Descent), and RMSProp (Root Mean Square Propagation) Selecting the appropriate optimizer is essential for achieving optimal performance in training models.

The selection of an appropriate loss function is crucial for accurately measuring the discrepancy between a model's predictions and actual labels In the context of object recognition and segmentation tasks, commonly employed loss functions include Mean Squared Error (MSE) and Intersection over Union (IoU) loss.

- Regularization: To avoid overfitting, regularization techniques such as L1/L2 regularization or dropout can be used.

- Early stopping: Use the early stopping technique to stop the training process when there is no significant improvement on the validation dataset.

- Testing and tuning: After setting the parameters, it is necessary to test and tune to optimize the performance of the model before official training.

The training parameter setting process requires careful control and tuning to ensure the model achieves the best performance and avoids overfitting.

Processing results from the model

Once the pavement damage recognition and segmentation model has been implemented, the results are processed to generate the necessary decisions or reports.

- Result segmentation: Based on the bounding boxes or masks generated by the model, determine the locations and types of damage on the pavement.

- Result filtering: Thresholds can be applied to eliminate incorrect or unnecessary detections For example, removing small or unimportant objects.

- Feature calculation: If necessary, additional features can be calculated from the recognition and segmentation results For example, the depth of road rut, the length of cracks, etc.

- Image tagging: Re-tag the images with information about the locations and types of damage that have been recognized for easy review and assessment later.

Figure 4.6: Google Colab Image Test Results

Figure 4.7: Google Colab Video Test Results

Generate reports and send notifications

- Generate reports: Based on the processing results from the model, automatically generate reports on the condition of the pavement, including location, type of damage, and severity.

- Send notifications: The system can automatically send notifications or warnings to managers or pavement maintenance teams Notifications can include content about detected problems that need to be resolved.

- Update management system: Results and reports from the model can be updated directly into the road management system for monitoring and planning repairs and maintenance.

- Overall assessment: Synthesize results from the recognition and segmentation model to evaluate the overall condition of the pavement, make decisions about repairs, maintenance or improve the management system.

- Data storage: Results, reports and notifications need to be stored for later retrieval and analysis.

Automating the management, monitoring, and maintenance of roads is achieved through the processing of results and report generation from pavement damage identification and segmentation models, which delivers comprehensive and precise information to managers and maintenance teams.

Web application development

Introduction to the Web Application

The road surface damage segmentation web application addresses the challenge of identifying and classifying road damage, enhancing the efficiency of road monitoring, maintenance, and repair Utilizing the YOLO (You Only Look Once) deep learning model, the application effectively detects various types of damage, including potholes and cracks, from user-uploaded images or videos.

This web application enables traffic engineers and road authorities to monitor and analyze road surface conditions in real time by processing both static images and videos It accurately identifies damages swiftly, optimizing the road maintenance and repair process while saving time and costs.

The road surface damage segmentation web application has the following main functions:

Users can easily upload image and video files related to road surfaces, which the system will then process to analyze and identify any damage present.

- Damage segmentation: The application uses the YOLO model to detect and segment damage such as potholes, cracks and other damage, helping to clearly identify areas that need repair.

- Display uploaded images and videos: After the user uploads the file, the application will display the original image or video so that the user can check the content.

- Display segmented images and videos: After the segmentation process is completed, the system will display the processed image or video with clearly marked damage.

- Download segmented images and videos: Users can download segmented images or videos, making it easier and more convenient to store and share the results.

These functions help users easily interact with the application, serving well the work of monitoring and maintaining traffic routes.

Environment Setup and Configuration

To successfully deploy the road damage segmentation web application, it is essential to follow specific environment configuration steps This includes detailed instructions for setting up the environment and installing the necessary libraries to ensure optimal functionality of the application.

5.2.1 Install Python environment and necessary libraries

The application leverages Python and Flask to create a web interface, utilizing YOLO for damage segmentation in images and videos To successfully deploy the application, it is essential to install several key libraries.

We use Terminal >>> pip install flask pip install ultralytics pip install opencv-python pip install pillow pip install werkzeug pip install numpy

- Flask: A web framework used to build applications, handle requests from users, and manage file uploads.

- Ultralytics (YOLO): This is a library that helps load and use the YOLO model to detect damage.

- OpenCV (Open Source Computer Vision Library): Used for image and video processing, including operations such as reading video files, drawing rectangles around objects, and saving video frames.

- Pillow: This library supports image processing, allowing opening and editing image files, such as adding blur effects to damage.

- Werkzeug: This library provides tools to support uploading image or video files from the user side.

- NumPy (Numerical Python): This is a library that helps perform array operations, which are essential in image and video processing.

Once the required libraries are installed, the subsequent step involves configuring the Flask application environment by establishing the upload and output directories, which facilitates efficient management of temporary and output files.

To configure the upload and output directories for user files, ensure that uploads are stored in the static/uploads/ directory and segmented outputs in the static/Output/ directory To automatically create these directories if they do not exist, implement the following code: os.makedirs(UPLOAD_FOLDER, exist_ok=True) and os.makedirs(OUTPUT_FOLDER, exist_ok=True).

To ensure optimal system performance, we have implemented a file upload size limit of 50MB This restriction is set in the application configuration with the directive: app.config['MAX_CONTENT_LENGTH'] = 50 * 1024 * 1024.

The application utilizes the YOLO model for detecting and segmenting damage in both images and videos This model is integrated into the entire analysis process, ensuring accurate results To begin using the YOLO model, specific installation steps must be followed.

To utilize the YOLO model, it is essential to load a pre-trained version from a specified location The following code snippet demonstrates how to accomplish this: ```pythondef load_model(model_path='models/best.pt'): try: model = YOLO(model_path) return model except Exception as e: print(f"Error loading model: {e}") return None```This function attempts to load the YOLO model and handles any errors that may occur during the process.

To effectively recognize damage in images or videos, the trained YOLO model must be stored in the models/ folder When the system loads this model, it utilizes it for accurate damage detection.

The application requires image and video processing to detect and segment the damage The processing steps include:

The YOLO model will be utilized for image processing to detect and segment damage when a user uploads an image The processed results will include blur effects and color overlays to highlight the damaged areas, and these will be saved to the specified output path.

# Run YOLO model results = model(filepath)

# Open original image image = Image.open(filepath).convert('RGB') draw = ImageDraw.Draw(image)

To process each detection in the results, extract the bounding box coordinates by calculating the center and dimensions The x and y coordinates for the top-left corner are determined by subtracting half the width and height from the center, while the bottom-right corner is calculated by adding half the width and height to the center This ensures accurate positioning of the detected objects within the image.

# Apply effects for road damage (class 0) if result[-1] == 0: region = image.crop((x1, y1, x2, y2)) blurred_region = region.filter(ImageFilter.GaussianBlur(5)) image.paste(blurred_region, (x1, y1, x2, y2))

To process segmentation masks, first check if results[0].masks is not None Iterate through each mask in results[0].masks.data.cpu().numpy(), and for each mask where results[0].boxes.cls[i] equals 0, create a mask image from the mask array Resize the mask image to match the original image dimensions and create a red mask color overlay Apply a Gaussian blur to the mask image, then use it to composite the red overlay onto the original image Finally, save the processed image to the specified output path and handle any exceptions that may occur during the process.

When a user uploads a video, each frame is processed using the YOLO model, which detects and marks objects within the video The processed video is then saved as a segmented version The function `process_video(input_path, output_path)` utilizes OpenCV to capture the video, ensuring it opens correctly It employs the H.264 codec for better browser compatibility and retrieves the video's frame rate, width, and height The processed frames are written to a new video file until all frames are read.

# Run YOLO model results = model(frame)

The process involves extracting bounding boxes and masks by iterating through the results, which include the coordinates, confidence scores, and class labels Each bounding box is defined by its corner coordinates (x1, y1, x2, y2), and the confidence score is assessed for accuracy The class label is assigned as "Pothole" for class 0, while other classes are labeled accordingly as "Class X."

# Draw rectangle cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 2)

To create a label for an object detection task, format the label with the class name and confidence score using the syntax `label = f"{class_label} {confidence:.2f}"` Determine the size of the label text with `cv2.getTextSize`, specifying the font and size Define the background rectangle for the label by calculating its coordinates based on the text size Use `cv2.rectangle` to draw the background on the frame, and then apply `cv2.putText` to overlay the label onto the frame, ensuring it is positioned correctly above the detected object.

# Apply mask for class 0 (Pothole) if int(cls) == 0 and mask is not None: mask_resized = cv2.resize(mask, (frame.shape[1], frame.shape[0])) mask_binary = (mask_resized > 0.5).astype(np.uint8)

# Create colored overlay for the mask red_overlay = np.zeros_like(frame, dtype=np.uint8) red_overlay[:, :, 2] = mask_binary * 255 # Set red channel

# Blend mask with frame frame = cv2.addWeighted(frame, 1, red_overlay, 0.5, 0)

# Write the frame to the output video out.write(frame) cap.release() out.release() print(f"Saved output video at: {output_path}")

# Convert the processed video to ensure it's browser-compatible convert_video_to_h264(output_path, output_path) except Exception as e: print(f"Error processing video: {e}")

5.2.5 Configure Flask application to run on localhost

To deploy the web application, configure Flask to run on localhost using port 5001 The following code snippet will initialize the Flask application, enabling it to handle HTTP requests from users: `if name == " main ": app.run(port=5001, debug=True, use_reloader=True)`.

Once the application is running, you can access the web application at http://127.0.0.1:5001 to upload images or videos and view the segmentation results.

File Upload Processing

This section describes how the web application handles files uploaded by users,including validating the file, storing the file, and classifying the image or video.

This functionality is an important part of the application because it ensures that user input is handled accurately and efficiently.

5.3.1 Check and process uploaded files

When a user uploads a file through the web interface, the application performs the following checks:

To check if a file has been uploaded, the application verifies the presence of the 'file' key in the request If no file is selected, it responds with an error message indicating "No file part" and returns a 400 status code.

To ensure proper file handling, the application checks the file name; if the file is unnamed or invalid, it triggers an error message, returning a JSON response indicating "No selected file" with a 400 status code.

Only image files (png, jpg, jpeg, gif) and video files (mp4, avi, mov, mkv) are accepted This is verified by checking the file extension using the function: `def allowed_file(filename): return '.' in filename and filename.rsplit('.', 1)[1].lower() in`.

If all of the above checks are valid, the file will be processed further.

Files uploaded by users will be saved to the preconfigured static/uploads/ directory.

To prevent filename conflicts and ensure validity, the application utilizes the secure_filename method from the Werkzeug library This method safeguards the filename by processing it, and then the application constructs the file path by joining the upload folder path with the sanitized filename Finally, the file is saved to the specified location.

After saving the uploaded file, the application determines the file type (image or video) based on the extension and processes accordingly:

The application utilizes the YOLO model to detect and segment damage in image files, including png, jpg, jpeg, and gif formats The processed results are saved in the static/Output/ folder, with filenames prefixed by "output_" followed by the original file name If the file extension matches the specified formats, the application constructs the output image path and processes the image accordingly.

Once processing is complete, a path to the resulting file will be generated to display on the web interface: image_url = url_for('static', filename='Output/output_' + filename)

The YOLO model processes each frame of video files (mp4, avi, mov, mkv) to detect and segment damage, saving the results in the static/Output/ folder as a segmented video The output video is generated by combining the file name with the output folder path, ensuring efficient processing and organization.

Similar to images, a link to the resulting video will also be generated to display on the web interface: video_url = url_for('static', filename='Output/output_' + filename)

In case the file is invalid (not in accepted formats), the application will return an error message to the user: return jsonify({"error": "Invalid file type"}), 400

Once a valid file is uploaded and processed, the application displays the result (segmented image or video) on the web interface:

- The segmented image is displayed with the resulting file path.

- The segmented video is played in the built-in video player.

Result Display and User Interface

This section outlines how the web application presents damage segmentation results on the user interface, seamlessly integrating processed images and videos to ensure an intuitive and user-friendly experience.

5.4.1 Main interface of the application

The home page of the application provides a simple and friendly interface, built with HTML and Flask The main interface consists of the following components:

- File upload frame: Allows users to select and upload images or videos from their devices.

- Display results: After processing, the results (segmented images or videos) will be displayed directly on the web page.

- Download results: Users can download the segmented result file.

After processing an image file, the application showcases the original image alongside the segmented damage This processed image is saved in the static/Output/ directory and is accessible through the user interface using Flask The image URL is generated with the code: `image_url = url_for('static', filename='Output/output_' + filename)`, and the results are rendered in the 'index1.html' template, with the uploaded filename and the path to the image with boxes provided.

Figure 5.2: Image Segmentation Results on the Web

Users can download the segmented image by clicking the "Download ProcessedImage" button.

Figure 5.3: Image segmentation results of the cat

Figure 5.4: The result of the image segmentation of the flower

In figure 5.3 and figure 5.4, there are pictures of a cat and a flower used for recognition The results show that they cannot be identified and segmented because this is not a pothole.

The application features a built-in video player that enables users to view segmented videos directly These videos are stored in the static/Output/ folder and are linked to the interface using the video URL, allowing for seamless integration in the rendering of the webpage.

Figure 5.5: Video Segmentation Results on the Web

Users can download the segmented image by clicking the "Download Processed Video" button.

The application is designed to provide an interactive experience:

- When uploading a file, the user immediately receives feedback, including error messages (if any) or processing results.

- Segmented images and videos are displayed directly without refreshing the page.

- Providing a download link makes it easy for users to store and use the results.

The application utilizes the H.264 (MP4) video format for optimal compatibility with modern browsers, and it can convert videos as needed to ensure a smooth playback experience The conversion process is facilitated by a specific function designed to handle the video format transition efficiently.

] subprocess.run(command, check=True)

Here is the data and source code of the project, I have uploaded it to GitHub You can access it at: https://github.com/huyenmyh/doantotnghiep

Discussion and Conclusion

Discussion

The model has demonstrated impressive accuracy in detecting and segmenting pavement damage when evaluated on the test dataset This accuracy is determined through key metrics, including Precision, Recall, and F1-score, highlighting the model's effectiveness in performance assessment.

- Precision: The ratio of the number of correct predictions to the total number of predictions High Precision shows that the model is able to avoid making wrong predictions.

Recall is the measure of a model's ability to identify true instances of a specific class, calculated as the ratio of correct predictions to the actual occurrences of that class A high Recall indicates that the model effectively captures a significant number of real cases within that category, demonstrating its effectiveness in identifying relevant data points.

- F1-score: A combination of Precision and Recall to evaluate the overall performance of the model The higher the F1-score, the better the model.

Evaluating the accuracy of the model on mask prediction and bounding box prediction:

Figure 6.1: Bounding Box Recall-Confidence Curve and Mask Recall-Confidence

Bounding Box Recall Confidence Curve:

Indicates the confidence level of the model in predicting the location of the defects, represented by the bounding boxes.

Indicates the confidence level of the model in predicting the shape and location of the defects, represented by the masks.

The analysis of Figure 6.1 reveals two significant curves: the Bounding Box Recall Confidence Curve and the Mask Recall Confidence Curve Both curves demonstrate a strong upward trend, indicating that the model's confidence in predicting the location and shape of defects increases with higher confidence levels Notably, the Mask Recall Confidence Curve surpasses the Bounding Box Recall Confidence Curve, suggesting that the model exhibits greater accuracy in predicting the shape of defects compared to their location.

The image shows the Mean Average Precision (mAP) value at IoU threshold 0.5,which shows the effectiveness of the model in predicting the location of the damage with average accuracy.

While the exact mAP value is not disclosed, it can be inferred that the model demonstrates high accuracy in predicting damage locations, particularly at an IoU threshold of 0.5.

This shows that the model can effectively differentiate between damage and other objects in the image and determine their exact location.

The model boasts an impressive average processing speed, making it suitable for real-time deployment in road monitoring systems Each frame's processing time has been thoroughly assessed to ensure it meets the necessary real-time performance standards.

The model's processing speed in road monitoring systems is assessed to guarantee effective data processing in near-real time This capability is crucial for the timely detection and management of road damage.

- Model stability: The model has been tested and evaluated for stability during data processing Stability is important to ensure that the model operates reliably and stably under all conditions.

- Stability testing: The model has been tested on diverse data sets and under different environmental conditions to ensure that it operates stably and reliably.

- Handling of edge cases: The model has been tested and properly handled edge cases, such as low light, difficult environments to ensure stability under all situations.

The model demonstrates robust stability by effectively managing abnormal data, minimizing the impact of noisy inputs Performance evaluation highlights high accuracy on the dataset, rapid processing speeds suitable for real-time deployment, and consistent reliability in data handling, ensuring the model operates effectively under various conditions.

6.1.2 Analyze, evaluate actual results and compare with previous studies

6.1.2.1 Compare results with traditional methods

The model outperforms traditional methods, such as human visual inspection, by delivering superior accuracy in defect detection It excels at identifying small or subtle defects that may be overlooked by the human eye, making it a more reliable solution for quality control.

The model significantly outperforms traditional methods, such as manual investigation and assessment, by offering rapid and automated processing capabilities This efficiency not only accelerates the workflow but also conserves valuable time and effort, making it a superior choice for streamlined operations.

The model offers superior stability and continuity compared to traditional methods, functioning continuously without the fatigue that typically affects human operators Unlike human performance, which can be compromised by factors such as exhaustion and loss of focus, this model ensures consistent and reliable operation at all times.

Gathering user feedback is essential for assessing the effectiveness and practical applicability of the model Engaging with road managers, road maintenance personnel, and other relevant stakeholders provides valuable insights that can enhance the model's performance in real-world scenarios.

- Utility and effectiveness: Evaluate the level of utility and effectiveness of the model in the daily work of users Whether the model helps them save time, effort and resources.

- Contribution to road management and maintenance process: Users can evaluate whether the model provides useful and necessary information to decide on road repair and maintenance.

- Practicality and application assessment: Feedback on the practicality and application of the model in real-life situations Whether the model meets the special requirements and situations in daily work.

- Feedback and suggestions for improvement: Receive feedback and suggestions for improvement from users to improve the efficiency and applicability of the model in the future.

Evaluate the actual results by comparing them with traditional methods and gathering user feedback, which aids in assessing the model's efficiency, applicability, and potential improvements in real-world settings.

Detailed performance analysis and comments of YOLOv8n and YOLOv11n models

Figure 6 3: Comparison of Metrics for YOLOv8n and YOLOv11n Models

IoU is a crucial metric that measures the overlap between the predicted region of the model and the actual region in the image.

- YOLOv8n achieved an IoU of 0.604178, whereas YOLOv11n reached 0.692934.

- This demonstrates that YOLOv11n outperforms YOLOv8n in accurately identifying the damaged region on the road surface.

The 9% improvement indicates that YOLOv11n not only identifies the object correctly but also predicts the bounding box more accurately.

Precision measures the proportion of true positive predictions out of all positive predictions made by the model:

- YOLOv8n achieved a Precision of 0.910493, while YOLOv11n achieved 1.000000.

- This implies that YOLOv11n makes no false positive predictions.

YOLOv11n boasts an impeccable Precision value, guaranteeing the accuracy of its predictions This reliability is essential in traffic safety applications, where inaccurate assessments could result in misguided repair or maintenance choices.

Recall indicates the ability of the model to detect all actual objects:

- YOLOv8n scored 0.624425, while YOLOv11n scored 0.692934.

- This reveals that YOLOv11n detects more instances of damage than YOLOv8n.

The 7% discrepancy highlights that YOLOv8n is prone to missing certain objects, resulting in False Negatives, while YOLOv11n offers more thorough detection capabilities This distinction is crucial, as overlooked minor damage can develop into significant problems if not addressed promptly.

F1-Score provides a balance between Precision and Recall:

- YOLOv8n achieved an F1-Score of 0.724076, whereas YOLOv11n achieved 0.800408.

- The higher F1-Score of YOLOv11n indicates a better balance between identifying all objects and avoiding incorrect predictions.

- YOLOv11n is better suited for real-world applications requiring consistent and accurate object detection.

5 Mean Average Precision (mAP): mAP is a comprehensive metric that evaluates the model’s accuracy across the entire dataset:

- YOLOv8n achieved an mAP of 0.822296, while YOLOv11n achieved 0.890434.

YOLOv11n achieves an impressive mAP of nearly 89%, demonstrating superior accuracy and stability over YOLOv8n This highlights YOLOv11n's capability to accurately detect and precisely localize objects.

- With an mAP (mean Average Precision) above 82%, YOLOv8n is still a highly efficient model suitable for applications requiring high speed and low resource consumption.

- Its high Precision and F1-Score make YOLOv8n a reliable choice for scenarios that demand a balance between performance and processing time.

Conclusion

The study effectively utilized the YOLOv11n deep learning model for the detection and segmentation of road damages, such as cracks, potholes, and subsidence, achieving an impressive accuracy of 85% and a mean Average Precision (mAP) score of 76% on the test dataset These results indicate that YOLOv11n can identify road damage efficiently and accurately, even in challenging conditions Nonetheless, the model's limitations in detecting small or complex damages point to potential areas for further enhancement.

Automating road damage detection through deep learning models like YOLOv11n enhances road maintenance and public safety by minimizing the need for manual inspections, which are often slow and error-prone This research improves the efficiency and accuracy of identifying road damage, paving the way for smarter and more sustainable transportation infrastructure systems.

6.2.3 Contributions to the field of research

This study contributes to the field of road infrastructure management and computer vision by:

● Implementing and evaluating the YOLOv11n model for road damage detection, showcasing its capabilities and limitations.

● Providing a benchmark for future studies seeking to improve accuracy and efficiency in similar applications.

● Highlighting the importance of robust datasets and data diversity in training deep learning models for real-world applications.

Future research should focus on the following areas:

● Dataset Enhancement: Collecting a more diverse dataset that includes images taken under varying weather conditions, lighting scenarios, and damage types.

Enhancing segmentation accuracy for small or intricate damages can be achieved by exploring hybrid models that integrate YOLOv11n with segmentation-focused networks such as U-Net, designed for biomedical image segmentation, or Mask R-CNN, which is used for instance segmentation.

● Real-Time Implementation: Investigating the deployment of this technology in real-time systems, such as integrating it with drones or autonomous vehicles for continuous road monitoring.

● Scalability and Adaptability: Ensuring that the model is adaptable to different road types, materials, and geographic regions.

In conclusion, this study highlights the effectiveness of deep learning technologies, particularly YOLOv11n, for automating road damage detection and segmentation Although the findings are encouraging, there is potential for enhancing robustness and adaptability By overcoming current limitations and broadening the research scope, future advancements could lead to more efficient, safer, and cost-effective road maintenance solutions.

[1] Brownlee, J (2021) Dive into Deep Learning Retrieved from: https://d2l.aivivn.com

[2] Goodfellow, I., Bengio, Y., & Courville, A (2016) Deep Learning MIT Press. Retrieved from: https://nttuan8.com

[3] Khanh, P D (2022) Deep AI Khanh Blog Retrieved from: https://phamdinhkhanh.github.io/deepai-book

[4] Computer Vision Engineer (2022) YouTube Channel Retrieved from: https://www.youtube.com/@ComputerVisionEngineer

[5] LearnComputerVision.io (2022) Retrieved from: https://learncomputervision.io

[6] Redmon, J (2021) Custom Dataset using Yolov8 YouTube Video Retrieved from: https://www.youtube.com/watch?v=fhzCwJkDONE

[7] Liu, Y., & Yang, Y (2019) Practical Deep Learning for Cloud, Mobile, and Edge O'Reilly Media.

[8] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., &

Rabinovich, A (2015) Going deeper with convolutions In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 1-9).

[9] He, K., Zhang, X., Ren, S., & Sun, J (2016) Deep residual learning for image recognition In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 770-778).

[10] Ultralytics (2024) YOLOv11: Next-Generation Object Detection Model. Retrieved from: https://docs.ultralytics.com/models/yolo11

[11] Zhang, K., Li, J., & Wang, X (2023) Deep Learning-Based Pavement CrackDetection Using YOLO IEEE Transactions on Intelligent Transportation Systems.Retrieved from: https://ieeexplore.ieee.org/document/10098765

Tiêu đề	Recognition and segmentation of road surface damage using deep learning technology
Tác giả	Tu Thi Huyen
Người hướng dẫn	Dr. Ha Manh Hung
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Informatics and Computer Engineering
Thể loại	Graduation project
Năm xuất bản	2025
Thành phố	Hanoi

Định dạng
Số trang	77
Dung lượng	1,32 MB