(Đồ án hcmute) design of advanced driver assistance system based on deep learning

INTRODUCTION

OVERVIEW

Advanced Driver Assistance Systems (ADAS) are crucial technologies in the automotive industry, designed to support drivers and enhance safety As the market evolves, Original Equipment Manufacturers (OEMs) are increasingly focusing on the development of sophisticated ADAS applications These systems aim to mitigate accidents resulting from human error, such as drowsiness and distraction, by providing alerts for potential hazards, assessing driving performance, and offering helpful suggestions Consequently, ADAS has become a vital area of research within intelligent automotive technology.

Automotive manufacturers are advancing towards fully autonomous vehicles through the innovation of Advanced Driver Assistance Systems (ADAS), which address human error—responsible for 94% of automotive accidents Key human factors include speeding, inattentiveness, and improper lookout ADAS features such as forward collision warning, brake assistance, adaptive cruise control, traffic sign recognition, lane keeping, and lane departure warning are crucial for accident prevention However, for ADAS to be effective, drivers must find it beneficial, which depends on the system's ease of use and the relevance of the information provided A critical yet underexplored aspect is the interface design between the vehicle and the driver, which plays a significant role in personalizing the ADAS experience.

ADAS applications typically utilize either a mono front camera or a stereo-vision camera, often enhanced by data from additional sensors like LIDAR or RADAR These cameras are generally installed on the front windshield, just behind the central rear-view mirror, with their field of view strategically positioned in the wiper area to minimize obstructions In some cases, RADAR sensing, vision sensing, and data fusion are integrated into a single module for improved functionality.

This article presents advanced Driver Assistance Systems (ADAS) that utilize road information captured by a mono color camera and the Nvidia Jetson AGX Orin vehicle computer for high-performance computing The intelligent processing of ADAS is powered by deep learning models, including YOLOv6 and Ultra Fast Lane Detection, which analyze critical observations of surrounding objects The vehicle computer interprets data related to traffic signs, road lanes, vehicles, and pedestrians to enable essential ADAS functionalities such as Forward Collision Warning (FCW), Traffic Sign Recognition (TSR), and Lane Departure Warning (LDW), subsequently displaying assistance information and alerts on a graphical user interface (GUI).

Figure 1.1: Diagram of the proposed system.

GOALS

This project introduces a centralized E/E hardware architecture for ADAS software that functions in real-time, enhancing existing automotive applications Key features include forward collision warning, lane departure warning, traffic sign recognition, and an intuitive graphic user interface, all designed to improve user experience and safety in future vehicles.

LIMITATIONS

This project focuses on developing models for Advanced Driver Assistance Systems (ADAS) using internet-sourced videos, acknowledging the limitations of ADAS in real-world scenarios such as crowded environments, unexpected changes, low light, and complex traffic situations For instance, traffic sign recognition is limited to identifying individual signs, unable to process mixed signals Additionally, the effectiveness of the ADAS software is contingent on the camera angles of the selected videos, as random dashcam footage may not yield reliable results Furthermore, while the safe distance for forward collision warnings is based on sound research, it remains an assumption when applied to videos downloaded from the internet.

OUTLINES

To summarize the main aspects of this thesis as follows:

This chapter introduces the topic, the objectives, the limitations, the related works of the research, and the layout of this thesis

This chapter gives the fundamental theory, the framework, and the algorithms for the implementation of the thesis, using the relevant studies as a reference source

This chapter presents a detailed design of the proposed work software and hardware, including data collection, algorithms, procedure, evaluation, and operation

This chapter presents the result of the proposed work

• Chapter 5: Conclusion and Future Work

This chapter gives the conclusion and some future works which will be conducted.

LITERATURE REVIEW

DEEP LEARNING

Deep Learning has become a powerful approach for analyzing large datasets, leveraging complex algorithms and artificial neural networks to enable machines to learn from experience and recognize data in a manner similar to the human brain Central to this technology is the Convolutional Neural Network (CNN), a type of artificial neural network widely employed in Deep Learning for tasks like image and object detection and classification CNNs play a crucial role in diverse applications, including image processing, computer vision tasks such as object detection and segmentation, video analysis, obstacle recognition in self-driving cars, and speech recognition in natural language processing.

A Convolutional Neural Network (CNN) consists of three primary layers: the convolutional layer, pooling layer, and fully connected layer Unlike fully connected networks, where each neuron connects to all neurons in the subsequent layer, CNNs feature a more streamlined architecture, as each neuron responds to a specific receptive field This reduced connectivity leads to lower computational costs The structure of a CNN effectively captures spatial features from images, with complexity increasing at each layer, enabling accurate object identification by distinguishing one object from another.

Figure 2.1: A basic CNN architecture for image classification

The convolutional layer (Conv) serves as the core component of Convolutional Neural Networks (CNNs), responsible for extracting features from input images This layer utilizes learnable filters, which are small in width and height but match the depth of the input image By performing convolution, it maintains the spatial relationships between pixels while learning essential image features The process involves a mathematical operation that takes two inputs: the input image matrix and the filter or kernel.

Figure 2.2: Demonstration of convolution operation with 5×5 input and 3×3 kernel

Stride and padding are essential concepts in convolutional layers Stride refers to the number of pixels the filter moves across the input matrix; for a stride of 1, the filter shifts one pixel at a time, while a stride of 2 means it moves two pixels at a time This process is illustrated in Figure 2.3, which shows how convolution operates with a stride of 1.

Figure 2.3: Demonstration of convolution operation 3×3 kernel over 5×5 image with padding 0, stride = 1 [4]

Padding is crucial in image processing as it provides extra space for the kernel to effectively cover the input image, ensuring accurate matching despite potential discrepancies By adding padding to the image frame, we can utilize convolutional layers without diminishing the output image size, which is vital for constructing deeper neural networks.

Figure 2.4: Demonstration of convolution operation 3×3 kernel over 5×5 image with padding 1, stride = 1 [4]

The convolutional output is enhanced by adding a bias and passing it through an activation function, typically the ReLU (Rectified Linear Unit), before it feeds into the network's convolution layer The ReLU function, which outputs positive inputs directly or zero, is favored in CNNs because it accelerates deep neural network training compared to traditional activation functions like sigmoid and tanh This efficiency allows the first hidden layer to learn from errors in preceding layers and adjust the weights accordingly In contrast, the sigmoid function is constrained between 0 and 1, leading to minor errors in the first hidden layer and potentially resulting in an inadequately trained neural network Consequently, this layer is referred to as the Conv layer.

In CNN architectures, incorporating pooling layers is essential for reducing the number of parameters when dealing with large images Spatial pooling, also referred to as sub-sampling or down-sampling, effectively decreases the dimensionality of feature maps while preserving important information Various forms of spatial pooling exist, such as max pooling, which selects the most significant element from the feature map; average pooling, which computes the average of all elements; and sum pooling, which calculates the total of all elements within the feature map.

Figure 2.6: Demonstration of max pooling with 3×3 filters on 5×5 image and stride = 1 [4]

A standard Convolutional Neural Network (CNN) architecture consists of several layers of convolution followed by ReLU activation and pooling, which are repeated until the image is reduced to a small spatial size The final component of this architecture is a fully connected layer that generates the output after processing through the preceding layers, making it a crucial part of the CNN structure.

After processing through multiple convolutional, ReLU, and pooling layers, the network identifies a variety of unique features within the image The output tensor from the final layer is then converted into a one-dimensional vector representing height, width, and channel dimensions (𝐻 × 𝑊 × 𝐶) This flattened vector is subsequently fed into a fully connected layer for further analysis.

Figure 2.7: Demonstration of the flattening process

The fully connected layer in an artificial neural network takes a feature map from the Conv-ReLU-pooling layers and processes it through several hidden layers with nodes, ultimately classifying the output using the softmax activation function This design can be enhanced by incorporating multiple fully connected layers for improved performance.

OBJECT DETECTION

Deep learning has gained significant attention over the past few decades, leading to extensive research aimed at enhancing its methodologies Recent advancements have demonstrated remarkable outcomes, particularly in practical applications that impact daily life A key development is the deep learning-based object detector, which addresses challenges in various fields such as medical image analysis, autonomous vehicles, business analytics, and facial recognition Object detectors are categorized into one-stage models like the YOLO series and two-stage models, including RCCN and Faster RCNN.

Two-stage detectors enhance object detection by first generating proposals that identify potential object regions in an image, aiming for high recall to ensure all items are captured In the second stage, a deep learning model classifies these proposals, labeling them as either background or objects from predefined classes, while also refining the initial localization provided by the proposal generator.

Figure 2.8 Overview of different two-stage detection frameworks for generic object detection:

R-CNN is a ground-breaking two-stage object detector deep learning-based approach proposed in 2014 The R-CNN pipeline is separated into three parts: (1) proposal creation,

R-CNN utilizes a three-step process: feature extraction, region classification, and generating around 2000 proposals per image through Selective Search, effectively filtering out background regions Each proposal is processed individually by a CNN for classification and bounding box regression, leading to redundant calculations and significantly increasing the time required for both training and testing.

The Faster R-CNN architecture improves efficiency by sharing features across both stages, utilizing a convolutional backbone like VGG or ResNet to generate global feature maps This shared approach reduces the cost of generating proposals for the Region Proposal Network (RPN) and the detection network Subsequent advancements have focused on enhancing detection accuracy through improved backbones that provide richer representations Notably, feature pyramid networks (FPN) have been introduced to extract Region of Interest (RoI) features from various layers based on scale Further research, including ResNeXt with grouped convolutions and Res2Net, has aimed to optimize the internal connections of residual networks to better leverage the multi-scale features present in convolutional maps.

One-stage object detectors consist of three main components: a backbone, a neck, and a head The backbone is crucial for feature representation and significantly affects inference efficiency due to its high computational cost The neck integrates low-level physical data with high-level semantic features to create pyramid feature maps across various levels Finally, the head uses multiple convolutional layers to predict detection outcomes based on the multi-level characteristics provided by the neck Notably, the Single Shot MultiBox Detector (SSD) and YOLO (You Only Look Once) pioneered this unified design, eliminating the need for pre-proposal computations.

Figure 2.9: Overview of different one-stage detection frameworks for generic object detection: a) YOLO and b) SSD [11]

YOLO revolutionized object detection by treating it as a regression problem, dividing the image into a predefined grid (e.g., 7 × 7), where each cell served as a proposal for object detection Initially, each cell was designed to predict the presence of up to two objects, providing information on item presence, bounding box coordinates, and class identification However, YOLO faced challenges, including its limitation to detect only two objects simultaneously, which hindered its ability to identify smaller or overlapping items, and its reliance on a single feature map for predictions, which proved inadequate for recognizing objects of varying sizes and aspect ratios.

The Single-Shot Multi-box Detector (SSD) improves upon the limitations of YOLO by utilizing a two-part architecture that includes a backbone model and an SSD head The backbone model, typically a pre-trained image classification network like ResNet or MobileNet, serves as a feature extractor with the last fully connected layer removed The SSD head consists of additional convolutional layers that process the outputs to generate bounding boxes and classify objects based on the spatial locations of the final layer activations.

The SSD framework utilizes multiple feature maps to predict objects of varying sizes, with each map tailored to specific object dimensions based on its receptive fields To enhance the detection of larger objects, additional convolutional feature maps were integrated into the original backbone design The network employs an end-to-end training strategy, optimizing both localization and classification losses through a weighted sum across all prediction maps Final predictions are generated by aggregating detection scores from these feature maps To mitigate the influence of negative proposals on training gradients, hard negative mining is implemented, alongside extensive data augmentation to improve detection accuracy As a result, SSD achieves detection performance comparable to Faster R-CNN while enabling real-time inference capabilities.

The updated versions of YOLO have significantly enhanced performance while preserving real-time inference speed YOLOv2 improved anchor priors using k-means clustering, which reduced localization optimization challenges, and achieved competitive results by incorporating Batch Normalization layers and multi-scale training YOLOv3 introduced a new feature extractor, Darknet-53, which features 53 convolutional layers and outperforms previous models like ResNet101 and ResNet152 while maintaining faster speeds YOLOv4 further refined the detection framework into three components—backbone, neck, and head—and implemented strategies like bag-of-freebies and bag-of-specials for efficient training on a single GPU.

A single CNN operation is required to generate the output of a one-stage object detector In the case of two-stage object detectors, the high-score region proposals obtained

The inference time for one-stage and two-stage object detectors can be described by the equations \( T_{\text{one}} = T_{\text{1st}} \) and \( T_{\text{two}} = T_{\text{1st}} + m \cdot T_{\text{2nd}} \), where \( m \) represents the number of area suggestions with confidence scores exceeding a specified threshold This indicates that one-stage object detectors have a fixed inference time, whereas two-stage detectors do not Consequently, for real-time applications, one-stage object detectors are essential.

Recent advancements in real-time object detection have focused on developing efficient architectures, primarily utilizing MobileNet, ShuffleNet, and GhostNet for CPU implementations, while ResNet and DarkNet are favored for GPU applications Currently, the leading real-time object detectors are based on the YOLO framework, specifically YOLOv5, YOLOv6, and YOLOv7 In this context, Advanced Driver Assistance Systems (ADAS) applications have been developed using YOLOv6, following a thorough evaluation of each architecture's performance, which will be detailed in Chapter 3.

YOLOv6 OBJECT DETECTION ARCHITECTURE

The YOLO series has emerged as the leading detection architecture in industrial applications, striking an optimal balance between speed and accuracy The development of YOLOv6 is driven by key observations aimed at enhancing the YOLO framework, including the recognition that RepVGG re-parameterization is an undervalued strategy in object detection, and that simple model scaling for RepVGG blocks is impractical The authors argue that consistent network design is not essential for small and large networks, favoring a plain single-path architecture for smaller models, while acknowledging the impracticality of this approach for larger models due to exponential parameter growth and computational costs They emphasize the need for careful quantization of reparameterization-based detectors to avoid performance degradation during training and inference Additionally, previous research has overlooked deployment considerations, often comparing latencies with high-cost hardware like the Nvidia V100, creating a gap for practical applications on more accessible GPUs and edge devices such as the Tesla T4 and Jetson systems The authors also highlight the importance of verifying advanced domain-specific strategies like label assignment and loss function design, and they advocate for training strategy adjustments that enhance accuracy without increasing inference costs, such as knowledge distillation To support various model sizes, they present two scaled reparameterizable backbones and necks, along with an efficient decoupled head utilizing a hybrid-channel technique, all designed with hardware-friendly principles in mind.

Figure 2.10: The overall YOLOv6 architecture RepBlock comprises a stack of RepVGG blocks with ReLU activations [8]

When developing effective backbone designs, key factors include the number of parameters, computation requirements, and computational density While advanced CNNs often achieve higher accuracy than basic models, they come with significant drawbacks Complex architectures, like ResNet's residual addition and Inception's branch-concatenation, complicate implementation, slow down inference, and hinder memory efficiency Additionally, components such as depthwise convolutions in Xception and MobileNets, as well as channel shuffling in ShuffleNets, can raise memory access costs and may lack adequate device support Consequently, the number of floating-point operations (FLOPs) does not reliably indicate actual inference speed; for instance, models like VGG and ResNet-18/34/50, despite having fewer FLOPs, may not necessarily offer faster performance.

As a result, VGG and the original versions of ResNets are still often used in academia and industry for real-world applications [13]

Figure 2.11: Presentation of RepVGG architecture versus ResNet and InceptionV2

The RepVGG features a straightforward architecture similar to classic VGG, comprising a series of convolutional layers, ReLU activations, and pooling without branches This simplicity makes it difficult for plain models to achieve performance levels comparable to multi-branch architectures The vanishing gradient problem complicates the training of deep networks, as gradients diminish during backpropagation, leading to performance degradation in deeper networks To address this, ResNet employs a strategy that effectively creates an implicit ensemble of shallower models by using 1×1 convolution layers for linear projection of feature maps, thus mitigating the vanishing gradient issue In contrast, the Inception block utilizes four parallel branches with varying convolutional window sizes (1×1, 3×3, and 5×5) to capture information across different spatial scales, optimizing model complexity by incorporating 1×1 convolutions to reduce channel counts in the middle branches.

Multi-branch architectures outperform plain models during training, but their drawbacks become apparent during inference To address this issue, RepVGG introduces a method that decouples the training-time multi-branch architecture from the inference-time plain architecture through structural re-parameterization, effectively transforming the model's parameters for optimal performance in both phases.

RepVGG employs identity and 1×1 branches during training, drawing inspiration from ResNet, but allows for the branches to be removed through structural re-parameterization This process, illustrated in Figure 2.11 (d), involves simple algebra post-training, where the identity branch acts as a degraded 1×1 convolution and the 1×1 branch as a degraded 3×3 convolution Consequently, RepVGG can create a single 3×3 kernel using the trained parameters from the original 3×3 kernel, along with the identity and 1×1 branches and batch normalization layers Ultimately, the modified model retains three convolutional layers for testing and deployment.

The inference-time RepVGG architecture utilizes a single operator type: the 3×3 convolution followed by ReLU activation This design choice enhances the model's speed on general computing devices like GPUs Moreover, RepVGG's streamlined operator requirement allows specialized hardware to achieve even greater performance, as it enables the integration of more computing units within the constraints of chip size and power consumption Consequently, an inference device optimized for RepVGG can accommodate a substantial number of 3×3-ReLU units while minimizing memory units.

Figure 2.12: Peak memory occupation in the residual and plain model [13]

The multi-branch topology in neural networks is memory-inefficient due to the need to retain results until addition or concatenation, leading to increased peak memory usage In a residual block, the input must be preserved until the addition is performed, which can hinder memory efficiency By maintaining a constant feature map size, memory consumed by inputs can be quickly released after processing, optimizing resource usage Designing specialized hardware for simple CNNs enables significant memory optimizations and reduces memory unit costs, facilitating the integration of additional computing units onto the chip.

Pyramid network algorithms play a crucial role in enhancing object detection by effectively fusing features from various levels of the backbone model Two prominent types of pyramid networks are Feature Pyramid Networks (FPN), utilized in YOLOv3, and Path-Augmented Networks (PAN), both of which contribute to improved detection performance.

YOLOv6 utilizes the PANet architecture, specifically RepPAN, for its neck design, incorporating RepVGG for smaller models and CSPStackRep blocks for larger ones This PANet topology enhances object localization through path augmentation, which refines low-level patterns Additionally, the structure employs concatenation and fusion techniques to accurately predict object classes and masks.

Figure 2.13: YOLOv6 Rep-PAN neck architecture [8]

As an image progresses through the layers of a backbone network, the complexity of its features escalates, transitioning from basic elements like edges and textures to more intricate components such as eyes and noses However, the use of varying stride convolutions and pooling layers leads to a decrease in the spatial resolution of the feature maps, resulting in a loss of spatial information Consequently, these high-level features become inadequate for accurately predicting pixel-level masks.

The PAN network operates in stages, each consisting of layers that produce feature maps with identical spatial dimensions For instance, in Figure 2.13, P3 and C3 represent feature maps from the same stage, where 'C' denotes feature maps from the backbone and 'P' indicates final feature maps Each stage in the bottom-up path generates a feature map through a RepBlock applied to the previous feature map, followed by concatenation with the corresponding stage feature map from the backbone A final 3×3 convolution is then executed on this combined output to yield the final features.

The object detection head is crucial for predicting bounding boxes and class scores in end-to-end object detection systems, which often utilize multiple heads to accurately identify objects at varying resolutions In the YOLO family, there are three heads dedicated to classification, localization, and regression Notably, while classification and localization in the coupled head share similar characteristics, they represent distinct tasks Research has highlighted the detrimental impact of spatial misalignment between these two functions, indicating that such conflicts can significantly hinder the training process and the effectiveness of both classification and regression tasks.

Figure 2.14: YOLOv6 decoupled head architecture [8]

As shown in Figure 2.14, the efficient decoupled head differs from the detection head of YOLOv5, with parameters shared between the classification and localization branches

In YOLOv6, the implementation of a hybrid-channel method inspired by YOLOX leads to a more efficient decoupled head design By limiting the middle 3×3 convolutional layers to just one, and jointly scaling the head's width through the width multiplier for both the backbone and neck, the model achieves significant reductions in processing costs, ultimately resulting in faster inference times.

SYSTEM DESIGN

OVERALL SYSTEM

Many car manufacturers are increasingly incorporating camera systems into their vehicles, particularly in high-end models, utilizing features like rear-view cameras to prevent accidents, front cameras for lane departure warnings, and stereo cameras for depth estimation While advanced applications require more sophisticated and costly processing units and sensors, evolving safety regulations and competitive market dynamics are driving the adoption of versatile and affordable camera technology in lower-end cars, making these features a standard in the automotive industry.

Figure 3.1: Warning zone of the sensors [16]

This article presents an Advanced Driver Assistance System (ADAS) that utilizes a front camera and the Jetson Orin vehicle computer to implement three key applications: forward collision warning, lane departure warning, and traffic sign recognition These functionalities are recognized as vital components of driver assistance technologies by the US National Highway Traffic Safety Administration (NHTSA).

Figure 3.2: The framework of the proposed ADAS

The camera frame is processed by the Jetson Orin, which handles the primary tasks efficiently To optimize performance and prevent application freezing, four threads operate concurrently, managing three resource-intensive tasks: traffic sign recognition, forward collision warning, and lane detection This multi-threading approach ensures smooth operation of the GUI application while effectively processing heavy object detection tasks.

Lane departure warning systems utilize asynchronous thread execution to process data efficiently Once completed, the results are sent to the final block for visualization and analysis, enabling the preparation of assistance information This critical data is then presented on the vehicle's graphical user interface (GUI) Further details on these applications will be provided in subsequent sections.

Figure 3.3: Apply concurrent programming for the system.

COMPARISON OF OBJECT DETECTION MODELS

This study focuses on developing a real-time ADAS application for low-power edge devices in vehicles, emphasizing the evaluation of object detection models for optimal accuracy and inference speed It benchmarks the latest YOLO models, including YOLOv5, YOLOv6, and YOLOv7, using a custom dataset derived from TT100K All models are converted to FP16 precision with Nvidia TensorRT to enhance speed on the Jetson Orin vehicle computer, as TensorRT significantly improves performance by fusing ReLU with convolution Additionally, the study compares FP32 and FP16 precision to analyze the trade-off between accuracy and speed.

The dataset utilized in this section is intended solely for training benchmark models, not the final models implemented in the proposed Advanced Driver Assistance Systems (ADAS) software, as the author identifies specific methods that yield superior results, which will be elaborated on in sections 3.3 and 3.4 This benchmark dataset consists of 16 classes, featuring 5,469 training images and 916 validation images, representing 85.6% and 14.4% of the total dataset, respectively Furthermore, as shown in Table 3.1, the dataset includes 9,248 traffic sign labels from the TT100K dataset and an additional 13,891 labels from this research, covering three categories: four-wheelers (cars, trucks, buses), two-wheelers (motorcycles, bicycles), and pedestrians.

Table 3.1: Classes and labels in Benchmark Dataset

Number Classes Train labels Validation labels

Each YOLO model was trained by RTX3090 with 150 epochs, an input size of 1280×1280, batch size of 12, a learning rate of 0.01, a momentum of 0.937, and slight data augmentation as shown in Table 3.2:

Table 3.2: Data augmentation for training benchmark model

Mix-up 0.15 Mix-up (probability)

The benchmark results, detailed in Table 3.3 and Figure 3.5, indicate that the YOLOv6s model outperforms YOLOv5s in mAP 50 FP16 by 18.05%, despite a slowdown of 11.41% Additionally, YOLOv6s surpasses YOLOv5m in both accuracy and inference speed, achieving improvements of 0.13% and 75.49%, respectively While YOLOv7 delivers the highest precision, it is also the slowest among the models tested Notably, there is no difference in precision results between FP16 and FP32, although FP32's inference time is approximately halved Ultimately, the YOLOv6s model in FP16 precision was selected for this ADAS software due to its optimal balance of accuracy and speed.

Table 3.3: Comparison of YOLO-series object detectors on the benchmark dataset

Models Params FLOPs Input size mAP 50 val, FP16 mAP 50 val, FP32

Figure 3.5: The detection result of YOLOv6s in FP16 precision on benchmark dataset.

TRAFFIC SIGN RECOGNITION

Traditional Traffic Sign Recognition (TSR) methods are categorized into color-based, shape-based, and machine learning-based techniques Color and shape detection primarily rely on specific image colors and shapes to manually extract features, such as SIFT (scale-invariant feature transform) and Histograms of Oriented Gradient (HOG) features, often matching them through templates However, these methods are susceptible to environmental factors like weather and lighting In contrast, machine learning approaches aim to extract invariant visual elements from traffic signs, enabling the recognition and classification of these signs to understand their semantic information A significant challenge for current traffic sign recognition systems is that traffic signs occupy a small portion of the overall road scene, complicating accurate detection.

Figure 3.6: TT100K Traffic Sign Dataset

The Tsinghua-Tencent 100K Dataset (TT100K) illustrates that traffic signs occupy approximately 0.2% of the total image pixels, highlighting the challenge of detecting both large and small-scale signs within the same frame This scale variation can lead to inaccuracies in detection, resulting in false positives or missed detections To achieve effective real-time detection in complex traffic scenarios, it is essential for traffic sign detection technology to ensure high accuracy and rapid inference speeds.

Figure 3.7: Small traffic signs in the TT100K dataset’s image

After surveying the YOLO-series architecture, the above problems can be solved by YOLOv6 architecture [8] and “bag-of-freebies” during training inspired by YOLOv4 [6]

YOLOv6 addresses the challenge of preserving spatial information for small-scale traffic signs by incorporating a RepPAN neck, which ensures that this information is effectively transmitted to deeper network layers Utilizing the RepVGG backbone enhances the feature extraction capabilities of the traffic sign detector, allowing it to delve deeper without encountering gradient vanishing issues This approach not only maintains the spatial integrity of small-scale traffic signs but also minimizes computational costs during inference.

This study focuses on pre-processing the TT100K dataset to tailor it for Vietnam's traffic signs, with a particular emphasis on prohibitory signs Speed limit signs indicating speeds less than 50 km/h have been excluded from the dataset, as they have not been in use in Vietnam since 2015 The resulting custom TSR-TT100K dataset, detailed in Table 3.4, comprises 12 classes, featuring 5,160 training images and 620 validation images, which represent 89.3% and 10.7% of the dataset, respectively.

Table 3.4: Classes and labels in the TSR-TT100K dataset

Number Classes Train labels Validation labels

This study enhances model performance by employing the "bag of goodies" strategy during training The YOLOv6s for Traffic Sign Recognition is trained from the ground up on an RTX3090, utilizing 200 epochs and an input size of 1280×1280 with a specified batch size.

The training process utilized a learning rate of 0.01 and a momentum of 0.937, alongside minor data augmentation, over 20 epochs For transfer learning, the model was fine-tuned with the trained weights over 300 epochs, applying a learning rate of 0.0032 and incorporating heavy data augmentation, as detailed in Table 3.5 below.

Table 3.5: Data augmentation for training from scratch and fine-tuning

Parameters Scratch Values Finetune Values Descriptions

Mix-up 0.15 0.243 Mix-up (probability)

The author discovered that using a default input size of 640×640 for the COCO2017 dataset was ineffective, as the loss value failed to converge after 50 epochs due to the small size of traffic signs To address this issue, the input size was increased to 1280×1280, allowing for better resolution and enhanced feature extraction from the tiny labeled bounding boxes in the dataset, ultimately improving the performance of the YOLOv6 architecture.

Training the YOLOv6s - TSR model at a resolution of 1280×1280 for 300 epochs requires approximately 9 hours, which is nearly three times longer than training at 640×640 The training performance, illustrated in Figures 3.8 and 3.9, showcases the model's effectiveness through two key loss metrics: classification loss and IoU loss IoU loss assesses the likelihood of an object being present in a specified region, while classification loss evaluates the accuracy of the algorithm's class predictions Notably, after 20 epochs, there is a significant decline in loss values, indicating improved model performance However, due to a limited dataset, the scratch model experiences rapid precision gains before leveling off around the 50-epoch mark.

Figure 3.8: YOLOv6s-TSR performance metrics during training from scratch

The YOLOv6 creators propose an early stopping strategy to select optimal weights for fine-tuning, leveraging transfer learning to enhance model performance on new data without the need for complete network retraining This approach yields an impressive initial mAP50 score of 0.954 during fine-tuning, with a subsequent 1% accuracy improvement noted after training However, more significant gains are observed in evaluation metrics, as training results are based on the average of just 20 images rather than the entire dataset Additionally, the confusion matrix generated at a resolution of 640×640 indicates that deep blue cells signify effective class differentiation by the fine-tuned weights, as illustrated in Figure 3.10.

Figure 3.9: YOLOv6s-TSR performance metrics during fine-tuning training

Figure 3.10: Confusion matrix from evaluating fine-tuning TSR weights

Recording images for every real-world scenario is challenging, but leveraging transfer learning and robust data augmentation allows models to learn from diverse situations The inference results from a random road in Vietnam demonstrate that the fine-tuned model outperforms the scratch model, accurately detecting traffic signs with high precision In contrast, the scratch model shows lower accuracy, misidentifying the speed limit of 80 as 100.

Figure 3.11: Comparison inference result of scratch and fine-tuned TSR weights

Table 3.6 presents the mAP 50 evaluation results at a resolution of 640×640 after training for 200 epochs from scratch and 300 epochs of fine-tuning Fine-tuning improves the average accuracy of all classes by 4.76% compared to scratch weights While the fine-tuned weights achieve a higher accuracy of 9.48% at an inference size of 1280×1280, this comes with a significantly slower average inference time of 8.51ms compared to 2.41ms for the scratch weights Additionally, fine-tuned weights enhance accuracy by 16.4% compared to a benchmark dataset that includes most classes from the Custom TT100k dataset, excluding Limit Speed 70 This improvement is attributed to the benchmark dataset's numerous manually labeled instances of vehicles and pedestrians, which led to class imbalance and decreased performance Consequently, TSR fine-tuned weights were selected for the YOLOv6s model due to their superior accuracy and efficient inference at 640×640 resolution.

Table 3.6 Evaluate results from scratch, fine-tune and benchmark dataset

Classes mAP 50, Benchmark mAP 50, Scratch mAP 50, Finetune mAP 50, Finetune at 1280

The author enhances inference speed by converting fine-tuned weights from the TT100K dataset to a TensorRT floating-point 16 engine, maintaining accuracy as evidenced in Table 3.4 Traffic signs are typically positioned above and to the right of the human field of view, which is the focus of the TSR block designed to detect signs in these areas When the TSR block identifies a sign within the specified region and meets confidence and IoU thresholds of over 0.8 and 0.25, the sign's coordinates and name are sent to the final processing and visualization block The detection results for traffic signs in the proposed regions are illustrated in Figure 3.12.

Figure 3.12: Common location of traffic sign on the road

Figure 3.13: TSR result of traffic sign in above and right regions (Images are zoomed due to small bounding box, class name, score)

FORWARD COLLISION WARNING

Forward Collision Warning (FCW) is a crucial technology for reducing the risk of collisions with vehicles ahead, utilizing various sensors such as radar, LiDAR, and cameras While LiDAR and radar are commonly employed for distance measurement and FCW due to their effectiveness in detecting distant obstacles in adverse weather conditions, they often rely on costly hardware and have limited capabilities in identifying objects for advanced applications like traffic sign recognition and lane departure warnings In contrast, a single vision system offers a more cost-effective solution, eliminating the need for complex sensor integration and simplifying deployment, although it may lead to increased false detections when identifying vehicles using edge information.

Figure 3.14: The visualization of FCW

This study presents a low-cost, efficient solution for object classification using a deep learning-based embedded system, leveraging the YOLOv6s architecture Designed for low-power edge devices, this system delivers quick and accurate performance, making it suitable for operation within vehicles.

This study introduces a novel road dataset derived from TT100K, moving away from the YOLOv6s pre-trained weights on the COCO2017 dataset, which includes 80 classes The motivation behind creating this custom dataset lies in the diverse real-life traffic scenarios present in TT100K that remain underutilized for this project In contrast, the COCO2017 dataset features numerous images unsuitable for traffic contexts, as illustrated in Figure 3.15, leading to inefficiencies in custom training.

Figure 3.15: Comparison between COCO2017 and FCW-TT100K dataset containing at least a person or vehicle

The FCW-TT100K dataset is meticulously labeled for road scenarios, featuring 3,754 training images and 759 validation images, categorized into three classes: four-wheelers (cars, trucks, buses), two-wheelers (motorcycles, bicycles), and pedestrians, comprising 83.2% and 16.8% of the dataset respectively In total, there are 18,832 labels across both training and validation sets, as illustrated in Figure 3.16 and detailed in Table 3.7.

Figure 3.16: Sample labels in the FCW-TT100K dataset

Table 3.7: Classes and labels in the custom FCW-TT100K dataset

Classes Train labels Validation labels

This study employs the YOLOv6s - TSR training strategy for the FCW model, adjusting the resolution to 640×640 to accommodate the dataset's focus on large objects Training spans 400 epochs, taking nearly five hours, with Figures 3.17 and 3.18 illustrating the training performance metrics The loss value decreases significantly after 20 epochs, while the scratch training process shows a steady increase in mAP 50 until the 100th epoch, with early stopping implemented for optimal fine-tuning Although the fine-tuning metrics exhibit instability due to extensive data augmentation, the fine-tuned weights yield favorable evaluation and inference results, as shown in Table 3.8 and Figure 3.21.

Figure 3.17: YOLOv6s-FCW performance metrics during training from scratch

Figure 3.18: YOLOv6s-FCW performance metrics during fine-tuning training

The evaluation of the confusion matrix at 640×640 reveals a significant issue with high false positive backgrounds in the YOLOv6 model, indicating a critical flaw Despite this, the model excels at distinguishing between different classes The high-resolution images from the TT100K dataset, captured at 2048×2048 by an ultra wide-angle camera, showcase numerous objects and details, presenting both an advantage for future enhancements and a challenge due to the extensive effort required for accurate labeling, as illustrated in Figure 3.20 Consequently, the high false-positive background may stem from detecting distant vehicles, pedestrians, and difficult-to-label edges, yet the model demonstrates its capability to identify these elements effectively.

Figure 3.19: Confusion matrix from evaluating fine-tuning FCW weights

Figure 3.21 illustrates the inference outcomes of scratch and fine-tuned FCW weights from a random road in Vietnam, highlighting the superior accuracy and object recognition capabilities of the fine-tuned weights In contrast, the scratch weights exhibit lower accuracy and fail to detect numerous objects The yellow box indicates instances where the fine-tuned weights inaccurately outline an object's bounding box, while the red box represents the count of undetected objects.

Figure 3.21: Comparison inference result of scratch and fine-tuned FCW weights

Table 3.8 shows the mAP 50 evaluation at 640×640 resolution The average accuracy of all fine-tune weights classes improved by 4.76% and 12.47% compared to scratch weights and benchmark weights

Table 3.8: Evaluate results from scratch, fine-tune and benchmark dataset

Classes mAP 50, Benchmark mAP 50, Scratch mAP 50, Finetune

Currently, there is no tool available for directly comparing detection results across YOLO models for the same class However, an analysis of randomly selected dashcam footage reveals that the fine-tuned FCW-TT100K weights outperform the pre-trained COCO weights in accuracy, as illustrated in Figures 3.22 and 3.23 While the COCO dataset includes nearly 100,000 labels for cars, its pre-trained weights are generalized across 80 classes, and the images of cars and people are not specifically designed for traffic scenarios This indicates that the proposed model could achieve even better results with a dataset tailored specifically for traffic conditions.

Figure 3.22: Example 1 st : Detection result 1 from pre-trained COCO weights and FCW-TT100K fine-tuned weight

Figure 3.23: Example 2 nd : detection result from pre-trained COCO weights and FCW-TT100K fine-tuned weight

Establishing appropriate driving distances is essential to prevent forward collisions Research indicates that stopping distance (SD) varies across different road conditions—dry, wet, and snowy—and speeds When a driver perceives a danger, there is a delay known as thinking time (t1) before they decide to brake Once the decision is made, the time it takes for the driver to physically move their leg and engage the brake pedal is referred to as reaction time (t2) Additionally, there is a brief delay in the braking system activating, known as brake effectiveness time (t3), during which the vehicle continues at a constant speed Once the brakes are engaged, the vehicle decelerates steadily until it stops, a phase known as the braking period (t4).

Figure 3.24: Stopping distance dynamics when the driver perceives a danger [27]

The braking time 𝑡4 is affected by different factors such as the braking force, tire state, tire type, and friction force

Where V is the vehicle speed, 𝑡1 is the thinking time, g = 9.81 is the gravity constant, f is the adhesion coefficient that varies with the road type, and 𝑠 is the slope of the road This proposed ADAS assume the ideal environment hence the highest adhesive coefficient (f = 0.9) for the dry road is considered for all scenarios The research [27] also provide the

The proposed ADAS vehicle computer operates under typical values of 𝑡1, 𝑡2, and 𝑡3, which are 0.5 seconds, 0.2 seconds, and 0.3 seconds, respectively It assumes a standard city speed of 50 km/h for all scenarios Based on these parameters, the calculated stopping distance for the vehicle is 25 meters.

The system features a caution zone marked by a yellow mask and a danger zone indicated by an orange mask, as illustrated in Figure 3.25 Upon detecting vehicles and pedestrians, their coordinates are forwarded to the final processing block for analysis A safe distance of 25 meters is established, serving as the upper boundary between the danger zone and the vehicle If any detected object falls within the danger distance, a danger alarm is activated, accompanied by a graphical user interface (GUI) alert Additionally, if object coordinates are located within a region that extends 5 meters beyond the upper boundary of the danger zone, a warning is displayed through the GUI.

Figure 3.25: Caution and danger zone

This project utilizes multiple dashcam footages to address the limitations of fixed car cameras, allowing for manual adjustments of caution and danger zones based on varying conditions Proper calibration of the upper boundaries for these zones is essential to ensure that pixel dimensions align with a stopping distance of 25 meters and a lane width of 3.5 meters Customization for each vehicle is necessary due to differences in camera angle and position The demonstration results, depicted in Figure 3.26, illustrate the functionality of the Forward Collision Warning (FCW) system when a detected vehicle's bounding box enters the caution and danger zones.

Figure 3.26: Danger alert rise after a vehicle is in the danger zone.

LANE DEPARTURE WARNING

Lane detection can be approached through standard image processing methods or advanced deep segmentation techniques, with the latter gaining traction due to their superior representation and learning capabilities Despite their success, significant challenges remain in this field Lane detection is essential for autonomous driving, requiring low computational costs to ensure efficiency Current autonomous driving systems often integrate various vision and deep learning applications, all necessitating minimal computing resources To address these needs, the proposed Advanced Driver Assistance Systems (ADAS) feature the Ultra Fast Lane Detection (UFLD) method, which utilizes a row-anchor approach and ResNet18 backbone to achieve rapid processing speeds while effectively tackling the no-visual-clue problem.

In Figure 3.27, the process of selecting the left and right lanes is illustrated, with a detailed view of row selection on the right side The predefined row locations, known as row anchors, serve as the basis for our formulation, which involves horizontally selecting each row anchor Additionally, a background gridding cell is depicted on the right side of the image to signify the absence of a lane in that particular row [30].

Figure 3.29 illustrates the differences between UFLD and traditional segmentation methods, highlighting that the size of predefined row anchors (ℎ × 𝑤) is considerably smaller than the overall image dimensions (𝐻 × 𝑊), resulting in ℎ ≪ 𝐻 and 𝑤 ≪ 𝑊 Consequently, the original segmentation formulation requires significantly fewer computations, as it must handle (𝐶 + ).

1) 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 × 𝐶 × ℎ classification issues As a result, the computation cost of the UFLD method is lowered significantly Besides, the UFLD method uses global features as input, which has a larger receptive field than segmentation Context information and messages from other locations of the image can be utilized to address the no-visual-clue problem [30]

Lane detection faces significant challenges due to the no-visual-clue problem, particularly in scenarios with severe occlusion or distortion caused by varying lighting conditions These factors lead to a lack of visual cues, making it difficult to accurately identify different lanes, which are typically marked by distinct colors.

Figure 3.29: UFLD method selects locations (grids) on rows, while segmentation classifies every pixel The dimensions used for classifying are also marked in red [30]

The LDW block processes lane detection by assuming the camera is centrally mounted on the windshield It analyzes the positions of the left and right lanes relative to the center of the image frame, utilizing the known width between the lanes—such as 3.5 meters in Vietnam—to accurately estimate real-world distances corresponding to pixel measurements.

Figure 3.30: Illustration of lane detected result from UFLD

The central axis x represents the point of view and varies based on the simulation footage or the actual camera position on the vehicle To calculate the off-center distance, the left and right lane bottom x values are utilized, as they provide the most stable results from UFLD A warning flag is triggered if the off-center value exceeds 0.6m, with the calculation method outlined in Algorithm 1, as shown in Figure 3.31.

Algorithm 1: Calculate the off-center value

Input: x-axis coordinate of the left and right main lane

Output: The off-center value

𝑙𝑒𝑓𝑡 ← Left main lane bottom x-axis coordinate

𝑟𝑖𝑔ℎ𝑡 ← Right main lane bottom x-axis coordinate

𝑜𝑓𝑓_𝑐𝑒𝑛𝑡𝑒𝑟 ← Distance away from the center x-axis

Figure 3.31: Example of lane detection and off-center value.

GRAPHIC USER INTERFACE

The rapid growth of the automotive industry has sparked a revolution in vehicle design, with the graphical user interface (GUI) emerging as a key feature for automakers to personalize and improve user experience This study presents a GUI that effectively interacts with Advanced Driver Assistance Systems (ADAS) applications, developed using Qt Creator.

Figure 3.32: The GUI in Tesla model Y

The study interface draws inspiration from Tesla’s sleek GUI, as shown in Figure 3.32, and is organized into three main sections: the dashboard, inference, and notification The notification page offers driver assistance information illustrated in Figure 3.33, while the dashboard features essential buttons for operating the car’s functions, including three key applications: FCW, TSR, and LDW Additionally, there are options for playing demo videos and displaying real-time inference results from the backend Python on the inference page, with remaining graphic elements primarily serving aesthetic and credit purposes.

Figure 3.33: The dashboard page of GUI

Figure 3.34: The inference page of GUI.

ELECTRICAL/ELECTRONIC ARCHITECTURE

The introduction of the Electronic Control Unit (ECU) has revolutionized the automobile industry by enhancing vehicle electrification and mechatronics ECUs have evolved from merely managing engine operations to overseeing chassis control, electronic components, and in-car entertainment systems Today, each vehicle is equipped with multiple ECUs that regulate various features, reflecting a significant increase in the number of electronic controllers due to rising demands for fuel efficiency, safety, comfort, and entertainment For instance, a level 2 premium car now typically contains over 100 ECUs.

The Electronic Control Unit (ECU) integrates a microcontroller unit (MCU) and an embedded system, where the MCU primarily handles control functions rather than computation This limitation means that a single ECU can manage data-intensive tasks such as engine control, battery management, and motor control However, the future of vehicle development faces challenges due to the growing need for enhanced data processing and computing speed, driven by advancements in intelligent connectivity and autonomous driving technologies Specifically, developing driver assistance systems will require sophisticated logical processing and the ability to handle unstructured data Currently, the computational power of Advanced Driver Assistance Systems (ADAS) software has reached 10 Tera Operations Per Second (TOPS), while autonomous driving software is expected to approach 100 TOPS, far exceeding the capabilities of existing microcomputers.

The NVIDIA Jetson AGX Orin, a high-performance vehicle computer, is essential for managing three key ADAS applications and a user-friendly interface Featuring a powerful 2048-core NVIDIA Ampere architecture GPU, a 12-core ARM Cortex-A78AE 64-bit CPU, and 32GB of LPDDR5 RAM, it operates at a maximum power of 50W while delivering up to 275 TOP, ensuring power efficiency and supporting deep learning research A budget-friendly camera captures images at a resolution of 1280×720, transmitting data to the vehicle computer via USB The ADAS software's output is showcased on a 1920x1080 display through the DisplayPort interface.

Figure 3.35: Connection diagram of the proposed ADAS

This project addresses multithreading challenges in the ADAS system to ensure real-time performance on low-power edge devices while maintaining GUI responsiveness Evaluations of sequential versus concurrent execution indicate that concurrent execution surpasses sequential execution by 156.83% in end-to-end performance, measured from the input frame reception to the final GUI output rendering Although there is currently a lack of software for accurate hardware resource statistics on embedded devices, analysis of the system's execution reveals that concurrent execution utilizes 61% of GPU resources effectively.

As a result, this satisfies the system’s real-time requirement with FPS above 60, while the most power-consuming component in this vehicle computer - the GPU is only 12.8W

Table 3.9: Comparison of performance and hardware utilization between sequential and concurrent programming

Model Sequential (FPS) Concurrent (FPS)

RESULTS

SIMULATION RESULTS

The proposed Advanced Driver Assistance System (ADAS) utilizes cutting-edge object detection and lane detection models, benchmarked through a customized dataset derived from scientific research This study explores efficient deployment strategies for the selected models, incorporating specialized training techniques and concurrent programming to facilitate the rapid execution of three intensive computer vision and deep learning tasks, all while maintaining a responsive graphical user interface (GUI) Simulation results from the ADAS software, operating on 1280×720 dashcam footage, will be presented in the subsequent sections.

Figure 4.1: Example frames from dashcam footage to simulate the ADAS software

The YOLOv6s FCW and TSR models demonstrate high accuracy in detecting and recognizing vehicles and traffic signs across various frames Notably, the speed limit sign of 60 was detected with an impressive precision of 92% in Figure 4.2; however, it was absent from the designated area (highlighted in light blue boxes), leading to its omission from the GUI Conversely, Figure 4.3 shows the 60 speed limit sign present in the proposed region, allowing it to be displayed to the driver effectively.

Figure 4.2: 1 st Sample result of the traffic sign recognition during inference

Figure 4.3: 2 nd Sample result of the traffic sign recognition during inference

The YOLOv6s-FCW demonstrated high precision in vehicle detection, achieving an accuracy of 95% In the inference result frame, the vehicle's coordinates were found within the caution zone, triggering a "collision warning" notification on the GUI Similarly, another vehicle was accurately detected in a different scenario, with its coordinates located in the danger zone, prompting a "danger ahead" alert displayed in the inference result frame and on the GUI.

Figure 4.4: 1 st Sample result of the forward collision warning during inference

Figure 4.5: 2 nd Sample result of the forward collision warning during inference

The UFLD effectively identifies lane markers when they are visible, as demonstrated in Figure 4.6, where no lane departure warning is issued since the vehicle is traveling straight However, in Figure 4.7, the presence of an upper boundary (yellow line) in the caution zone helps in monitoring the lane's center As a result, a lane departure notification stating "driving off lane" is displayed on the GUI when the vehicle veers to the right.

Figure 4.6: 1 st Sample result of the lane departure warning

Figure 4.7: 2 nd Sample result of the lane departure warning during inference.

EXPERIMENTAL RESULTS

The experiment modified the input source to utilize a camera for capturing images, aiming to replicate real traffic conditions more accurately than using dashcam footage As illustrated in Figure 4.8, the proposed Advanced Driver Assistance Systems (ADAS) setup comprises a TV displaying Vietnamese traffic scenarios, a camera that sends images to the Jetson Orin vehicle computer via an AC-DC adapter, and a monitor for the graphical user interface (GUI).

(4) the keyboard and mouse for user interaction (3) and (4) are subjected to change to a touchscreen in the future for greater convenience

The results shown in Figure 4.9 indicate that the system achieves high accuracy in detecting nearly all vehicles and correctly identifies and displays lane markings, with the vehicle positioned centrally within the lane However, Figure 4.10 illustrates that the vehicle is veering to the left, prompting a caution alert on the vehicle's graphical user interface (GUI).

Figure 4.9: The safe result with multiple detected objects of the experiment

Figure 4.10: The lane departure warning result of the experiment

Figures 4.11 and 4.12 demonstrate that both FCW and TSR applications function effectively when utilizing dashcam footage as input This suggests that implementing the ADAS system in a controlled environment within a real vehicle is likely to yield optimal performance.

Figure 4.11: The traffic sign recognition result of the experiment

Figure 4.12: The forward collision warning result of the experiment.

CONCLUSION AND FUTURE WORK

CONCLUSION

This study presents three key Advanced Driver Assistance System (ADAS) applications utilizing state-of-the-art object detection via YOLOv6 and lane detection through UFLD, all integrated within an intuitive graphic user interface By optimizing YOLOv6 performance through fine-tuning and converting models to FP16 for rapid inference, the system achieves a remarkable accuracy A comprehensive benchmark of five YOLO models, alongside a custom dataset featuring over 18,832 labeled traffic scenario objects, resulted in mean Average Precision (mAP) scores of 88.6% and 82.1% for the TSR and FCW datasets, respectively The ADAS system operates in real-time at 71 frames per second while efficiently utilizing only 61% of GPU performance Simulation and experimental results confirm the system's effectiveness in delivering critical driver assistance features, such as intuitive warnings and instructions via the GUI This initial success of the ADAS software paves the way for future advancements in vehicle electronic and electrical architecture, ultimately leading towards the development of autonomous vehicles.

FUTURE WORK

Future enhancements will focus on improving practicality and aligning with OEM expectations The FCW application will be upgraded to calculate and adjust collision warning distances while enabling automatic braking through dedicated sensors and ECUs Additionally, TSR and FCW applications will improve object detection capabilities by incorporating a wider range of signs, vehicles, and pedestrians, supported by diverse traffic scenario data Furthermore, UFLD will be advanced by replacing the ResNet18 backbone with a superior model like RepVGG-A0 Lastly, models will be converted to INT8 precision using TensorRT to achieve faster inference speeds.

[1] S Singh, “Critical reasons for crashes investigated in the National Motor Vehicle Crash Causation Survey”, US National Highway Traffic Safety Administration, DOT HS 812

506, Washington DC, USA, pp 2-2, March 2018

[2] L Yue, M Abdel-Aty, Y Wu, and L Wang, “Assessment of the safety benefits of vehicles’ advanced driver assistance, connectivity and low-level automation systems”, Accident Anal Prevention, vol 117, pp 55-64, Aug 2018

[3] M Hasenjọger, M Heckmann and H Wersing, “A Survey of Personalization for Advanced Driver Assistance Systems,” in IEEE Transactions on Intelligent Vehicles, vol

5, no 2, pp 335-344, June 2020, doi: 10.1109/TIV.2019.2955910

[4] Dumoulin, Vincent, and Visin, Francesco “A guide to convolution arithmetic for deep learning.” arXiv, 2016, https://doi.org/10.48550/arXiv.1603.07285

[5] Redmon, Joseph, and Farhadi, Ali “YOLOv3: An Incremental Improvement.” arXiv,

[6] Bochkovskiy, Alexey, et al "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv, 2020, https://doi.org/10.48550/arXiv.2004.10934

[7] Jocher Glenn “YOLOv5 release v6.1” (2022) [Online] Available: https://github.com/ultralytics/yolov5/releases/tag/v6 1

[8] Li, Chuyi, et al “YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications.” arXiv, 2022, https://doi.org/10.48550/arXiv.2209.02976

[9] Wang, Chien, et al “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.” arXiv, 2022, https://doi.org/10.48550/arXiv.2207.02696

[10] Ge, Zheng, et al “YOLOX: Exceeding YOLO Series in 2021.” arXiv, 2021, https://doi.org/10.48550/arXiv.2107.08430

[11] Wu, Xiongwei, et al “Recent Advances in Deep Learning for Object Detection.” arXiv, 2019, https://doi.org/10.48550/arXiv.1908.03673

The study by Carranza-García et al (2021) evaluates the effectiveness of one-stage versus two-stage object detectors specifically in the context of autonomous vehicles utilizing camera data The findings highlight the performance differences between these detection methods, contributing valuable insights to the field of remote sensing and vehicle automation.

[13] Ding, Xiaohan, et al “RepVGG: Making VGG-style ConvNets Great Again.” arXiv,

[14] Lin, Tsung, et al “Feature Pyramid Networks for Object Detection.” arXiv, 2016, https://doi.org/10.48550/arXiv.1612.03144

[15] Zhang, Can, et al “PAN: Towards Fast Action Recognition via Learning Persistence of Appearance.” arXiv, 2020, https://doi.org/10.48550/arXiv.2008.03462

[16] Deloitte, “Autonomous Driving” (2019) [Online] Available:

[17] National Highway Traffic Safety Administration, “Driver Assistance Technologies”

(2022) [Online] Available: https://www.nhtsa.gov/equipment/driver-assistance-technologies

[18] Z Zhu, D Liang, S Zhang, X Huang, B Li, and S Hu, “Traffic-Sign Detection and Classification in the Wild,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp 2110-2118, doi: 10.1109/CVPR.2016.232

[19] NVIDIA TensorRT (2022) [Online] Available: https://developer.nvidia.com/tensorrt,

[20] Takaki, M.; Fujiyoshi, H Traffic Sign Recognition Using SIFT Features IEEJ Trans Electron Inf Syst 2009, 129, 824–831

In their 2005 conference paper presented at the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Dalal and Triggs introduced the concept of Histograms of Oriented Gradients (HOG) as a method for human detection This innovative approach, detailed in Volume 10 of the conference proceedings, has significantly influenced advancements in computer vision.

[22] Department of Transportation, “Circular Number: 91/2015/TT-BGTVT” (2015) [Online] Available: Thông tư 91/2015/TT-BGTVT tốc độ khoảng cách xe cơ giới xe máy chuyên dùng giao thông đường bộ

[23] Wei, Pan, et al “LiDAR and Camera Detection Fusion in a Real-Time Industrial Multi-Sensor Collision Avoidance System”, arXiv, 2018 https://doi.org/10.48550/arXiv.1807.10573

The article by Ziebinski et al (2016) provides an in-depth survey of Advanced Driver Assistance Systems (ADAS) technologies, focusing on the future perspectives of sensor fusion It discusses the integration of various sensors to enhance vehicle safety and automation, highlighting the significance of computational collective intelligence in advancing these technologies The publication is part of a larger work edited by Nguyen et al and is available through Springer International Publishing.

In the paper presented at the 2016 IEEE International Conference on Control System, Computing and Engineering, the authors Nur et al explore a novel method for vehicle detection that utilizes the shadows cast by vehicles By focusing on edge features, this approach aims to enhance the accuracy and efficiency of vehicle recognition systems The study, conducted in Penang, Malaysia, highlights the significance of shadow analysis in improving detection technologies within smart traffic management systems.

[26] Lin, Tsung, et al “Microsoft COCO: Common Objects in Context.” arXiv, 2014, https://doi.org/10.48550/arXiv.1405.0312

[27] Elsagheer Mohamed, S.A.; Alshalfan, K.A.; Al-Hagery, M.A.; Ben Othman, M.T Safe Driving Distance, and Speed for Collision Avoidance in Connected Vehicles Sensors

[28] Aly, Mohamed “Real-time Detection of Lane Markers in Urban Streets.” arXiv, 2014, https://doi.org/10.1109/IVS.2008.4621152

[29] Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M., Van Gool, L.: Towards end-to-end lane detection: an instance segmentation approach In: Proceedings of the IEEE Intelligent Vehicles Symposium pp 286–291 (2018)

[30] Qin, Zequn, et al “Ultra Fast Structure-aware Deep Lane Detection.” arXiv, 2020, https://doi.org/10.48550/arXiv.2004.11757

[31] NVIDIA “Jetson AGX Orin developer kit specification” (2022) [Online] Available: Jetson AGX Orin for Advanced Robotics | NVIDIA

Figure 1: Plagiarism check result by Turnitin.

Tiêu đề	Design of advanced driver assistance system based on deep learning
Tác giả	Thai Hoang Minh Tam
Người hướng dẫn	Le Minh Thanh, M.Eng
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Computer Engineering Technology
Thể loại	Đồ án
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	83
Dung lượng	10,08 MB