Research, design, and construct lane tracking and obstacle avoidance system for autonomous ground vehicles based on monocular vision and 2d lidar

MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION FACULTY FOR HIGH QUALITY TRAINING GRADUATION PROJECT AUTOMATION AND CONTROL ENGINEERING TECHN

INTRODUCTION

Proposal

Recent advancements in algorithms, sensors, and microprocessors have transformed autonomous driving from a theoretical concept into a practical reality Deep learning plays a crucial role, acting as the brain of autonomous systems that enable vehicles to make human-like decisions and understand their environment However, challenges such as lane-keeping in unpredictable conditions and obstacle avoidance remain significant Ongoing research is focused on developing adaptive behaviors and optimizing processing times for real-time systems To ensure the safety of passengers, pedestrians, and roadside objects, modern sensors capable of real-time data processing are essential Autonomous mobile robots equipped with Light Detection and Ranging (LiDAR) technology are being explored as effective solutions for obstacle avoidance.

In modern vehicle guidance systems, techniques utilizing cameras and Lidar have emerged, with significant advancements in lane detection One study focuses on image processing to identify road lanes, emphasizing three key operations: detecting vanishing points, measuring road widths, and determining the Region of Interest (ROI) This method employs picture enhancement and edge extraction through the Sobel filter, with the Hough Transform as the main tool for lane recognition Another research introduces a lane identification approach that combines a semantic segmentation network with an optical flow estimate network, achieving quick and reliable lane segmentation, discrimination, and mapping Additionally, enhancing sensor fusion systems by incorporating extra sensors has led to improved performance and resilience, particularly in utilizing camera data for localization and mapping, traditionally reliant on radar and Lidar Sensor fusion is rapidly becoming a vital component in the autonomous vehicle industry, significantly contributing to the development of autonomous systems.

This thesis presents the development of a low-cost autonomous ground vehicle system that utilizes a monocular camera and 2D Lidar for lane tracking and obstacle avoidance By employing deep learning techniques, the project enables real-time lane and object detection Additionally, mathematical equations are applied to cluster 2D Lidar data and align it with straight lines to accurately identify obstacles.

Research Objective

This project's study and execution will focus on the following goals:

 Research lane detection using "Ultra Fast Structure-aware Deep Lane Detection" to define the steering angle for the robot

 Research YOLO-v4 algorithm for detection and recognition of the traffic signs (left, right, straight, and stop) and obstacle objects ("car" with scale suitable for the system)

 Apply cluster and feature extraction algorithm for 2D LiDAR to support for obstacle avoidance system

 Using a controller for steering angle that is a PID controller

 Construct the robot which has able to run on the HCMUTE campus

 Learn how to control servo and motor using Arduino, Devo7, and NVIDIA Jetson TX2.

Limitation

Because the whole project is long-term, there are many remaining shortages In this thesis, some apparent limitations are presented as follows:

 The automobile cannot run in complex scenarios such as dynamic obstacles, abnormal weather, participating pedestrians, etc

 In our proposal, the self-driving robot can operate in specific locations of the HCMUTE campus

 The limitation of a 1:10 scale automobile and camera location prevents it from running correctly on large-sized roads

 Low-cost sensors do not perform well in outside circumstances

 Due to hardware limitations, this project promotes light yet efficient algorithms As a result, some of the most exact methodologies are not used in this thesis.

Thesis summary

The structure of this thesis is arranged as follows:

This chapter introduced the topic, the objectives, the limitations, the related works of the research, and the layout of this thesis.

LITERATURE REVIEW

Self-driving Car Technologies

The camera plays a crucial role in the visual perception system of self-driving vehicles, as shown in figure 2.1 It enhances the vehicle's awareness of its surroundings by working in conjunction with tracking and detection algorithms from the central controller In partially autonomous vehicles, cameras are integral to Advanced Driver Assist Systems (ADAS), including features like lane departure warnings and parking assistance For fully autonomous systems, researchers utilize a combination of cameras, Lidar, lasers, and IMU-GPS systems within a framework known as the 'Sensor Fusion' algorithm.

Figure 2.1 The camera system on MIT self-driving car [4]

Self-driving cars utilize LiDAR technology to assess their surroundings, functioning like a "pair of eyes" for the vehicle This technology is essential for the advancement of autonomous vehicles, and automakers are actively incorporating it into their development processes, as shown in figure 2.2.

LiDAR technology enables vehicles to perceive their surroundings by measuring distances to objects, creating a detailed 3D model of the environment An integrated computer system calculates the distance by monitoring the time between emitted laser pulses and their return to the receiver, processing up to 100,000 laser pulses per second This data allows LiDAR to construct accurate 3D representations of nearby objects, such as vehicles and pedestrians, while predicting their behavior to inform the vehicle's responses There are two main types of LiDAR: 3D-LiDAR and 2D-LiDAR, each designed for specific applications.

Figure 2.2 LIDAR emits lasers to the environment, receives bounced pulses and calculates distance, makes a point-clouds map, and reconstructs the surrounding environment

Figure 2.3 a) 3D LIDAR sensor; b) Low-cost 2D Lidar

LiDAR technology utilizes an infrared laser beam to accurately measure distances to nearby objects, distinguishing it from radar, which relies on radio waves Various types of LiDAR operate within the 900-nanometer wavelength range, with longer wavelengths being more effective in adverse weather conditions like rain and fog Modern LiDAR systems feature a spinning swivel that scans laser beams, creating a point cloud that captures the shape and size of objects based on the reflections of the laser pulses Compared to radar, LiDAR offers a more focused laser beam, increased vertical scan layers, and a higher density of points per layer However, current-generation LiDAR systems cannot directly measure object velocity, necessitating the use of multiple scans from different positions Additionally, LiDAR sensors require regular cleaning to maintain performance, as weather and dirt can hinder their effectiveness.

Deep Learning in Autonomous Driving

As shown in figure 2.4, there are three types of object recognition algorithms, each with a distinct form of prediction The following is a list of them:

 Image classification: classifies an image and gives the probability Traditional CNN is used in this method

 Classification and localization find an object in a photo Predicts if an object will be present and where it will be

 Detection: Multiple objects in the picture are detected by algorithms, including YOLO or R-CNN

Recent advancements in detection techniques have emerged, primarily focusing on object identification and shape recognition within images The two key output types of these models are bounding box detection and landmark detection, as illustrated in figure 2.5.

The Intersection over Union (IoU) technique measures the accuracy of object detection within a dataset by assessing the alignment of a predicted bounding box with the ground-truth bounding box IoU values range from 0 to 1, indicating the degree of overlap between the two boxes For a detailed understanding of the IoU equation, refer to figure 2.6.

Figure 2.5 In the left image, some parameters of bounding box detection, including a box of center, height , and width , the series of point in the right image

Figure 2.4 Three main types of object recognition algorithms

Intersection over Union is an effective metric for assessing customized object detectors

The Intersection over Union (IoU) quantifies the overlap between a predicted bounding box and the ground-truth bounding box relative to their union area An example illustrating the calculation of IoU for various bounding boxes is presented in Figure 2.7.

Figure 2.7 Example of validating the performance of output model

The Non-max suppression technique effectively eliminates duplicate overlapping bounding boxes of the same object by selecting the most representative ones As shown in Figure 2.8, bounding boxes with a probability lower than 0.6 are discarded, streamlining the detection process.

 Choose the most considerable probability of the predicted bounding box

 Remove any box that is IoU 0.5 with the previous box

2.2.1.1 You Only Look Once – YOLO

There are several object detection methods, which may be divided into two categories:

Classification-based algorithms operate in two stages: initially, they identify interesting regions within an image, followed by the application of convolutional neural networks to classify those regions However, this method can be time-consuming.

Figure 2.6 The equation of the Intersection over Union

Figure 2.8 visualize step of the Non-max suppression methods

To effectively implement forecasting in selected areas, the Region-based Convolutional Neural Network (RCNN) and its variants, Fast-RCNN and Faster-RCNN, are prominent examples of this technique.

One-stage regression-based algorithmic methods analyze an entire image to predict class labels and bounding boxes simultaneously, rather than focusing on smaller sections A prominent example of this approach is YOLO (You Only Look Once), a widely recognized real-time object detection technique.

The operations of single-stage and two-stage object detectors are illustrated in figure 2.9 In a single-stage approach, the detection head is applied directly to the feature map, while a two-stage approach utilizes a region-proposal network before detection For applications like autonomous driving, achieving high accuracy and real-time inference speed is essential, making the selection of an efficient object detector crucial YOLO (You Only Look Once) serves as a prominent single-stage object detector that balances speed and accuracy This article will explore all versions of YOLO (from YOLOv1 to YOLOv4) to provide an overview of this influential family Since its inception, object detection has evolved significantly, with state-of-the-art designs demonstrating strong generalization across various test datasets Understanding the origins of the YOLO family and its advancements is key to appreciating the innovative structures that are now regarded as the best in the field Today, we will focus on the Single-Stage Deep Learning YOLO object detection family.

YOLO (You Only Look Once) represents a groundbreaking advancement in object detection by treating the detection process as a regression problem This innovative single-stage object detector architecture analyzes an image in a single pass to predict the positions and class labels of objects YOLOv1 utilizes a unified neural network to effectively forecast class labels and bounding box coordinates.

Figure 2.9 Detector workflow (a) One-Stage Detector, (b) Two-Stage Detector [30]

The single-stage detection technique processes probabilities and bounding box coordinates from an entire image in one pass, contrasting with the two-stage detector methods like Fast RCNN and Faster RCNN This streamlined detection pipeline operates as a unified network, allowing for end-to-end optimization.

The YOLO architecture is designed for high-speed performance, achieving 45 frames per second (FPS) on a Titan X GPU This efficiency stems from its end-to-end training approach, akin to image classification Additionally, the authors introduced Fast YOLO, a more lightweight variant that processes images at an impressive 155 FPS while utilizing fewer layers than the original model.

Redmon and Farhadi (2017) introduced "YOLO9000: Better, Faster, Stronger," presenting two advanced variations of the YOLO algorithm: YOLOv2 and YOLO9000 While both variants share similarities, they differ in their training strategies, utilizing detection datasets like Pascal VOC.

YOLOv2's algorithm was trained using the MS COCO dataset, while YOLO9000 was developed to predict over 9000 unique object categories by simultaneously training on both the MS COCO and ImageNet datasets.

The enhanced YOLOv2 model surpassed leading techniques like Faster-RCNN and SSD in both speed and accuracy, utilizing advanced methodologies A key innovation, multi-scale training, enabled the network to predict using different input sizes, effectively balancing accuracy and speed.

YOLOv2 achieved a mean Average Precision (mAP) of 76.8 on the VOC 2007 dataset with an input resolution of 416x416 pixels on the Titan X GPU Additionally, it reached a mAP of 78.6 while maintaining a frame rate of 40 frames per second on the same dataset.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) consist of three primary types of layers, as depicted in figure 2.15 These layers serve as the essential components that enable the network to process and analyze data effectively.

 Input layers: It's the layer in which we feed our model with information There are the same number of neurons in this layer as in our data

The hidden layer in a neural network receives input from the input layer, and the number of hidden layers can vary based on the model and data Typically, each hidden layer contains more neurons than the total number of features The output of a nonlinear network is produced by multiplying the output from the previous layer by the learnable weights and biases of the current layer, followed by the application of an activation function to the resulting matrix.

 Output layer: Each class is then input into a logistic function such as sigmoid or softmax, which transforms the output of each class into a probability score

Convolution Layers serve as the foundational component for feature extraction from input images They utilize a series of filters with defined sizes, such as 3x3 or 5x5, to perform convolution operations The result of this process is referred to as a feature map or activation map, as illustrated in Figure 2.16, which demonstrates the convolution operation on a 6x6 image.

Figure 2.15 The architecture of Convolutional Neural Networks [31]

Figure 2.16 The CONV for image (6x6) with 3x3 kernel and stride =1 The result is a 4x4 feature map [13]

The pooling layer (POOL) is a downsampling technique that involves sliding a smaller filter across each channel of the feature map Its primary purpose is to reduce the number of parameters and computational costs when dealing with large input data, while still preserving essential information that enhances the model's robustness to variations in feature positions within the input image Two commonly used functions in the pooling operation are employed to achieve these objectives.

Max pooling selects the highest value based on the filter's perspective, resulting in a feature map that highlights the most significant features from the previous layer This technique focuses on the brightest pixels in an image, making it particularly useful when the image's borders are dark The operation of the max pooling layer on a 4x4 matrix is illustrated in Figure 2.17.

Average pooling involves calculating the average for each patch of a feature map, resulting in a smoothed image This technique is exemplified by the operation of average pooling on a 4x4 matrix, as shown in figure 2.18.

Figure 2.17 Max pooling for 4x4 matrix with 2x2 filter and 2x2 stride

Figure 2.18 Average pooling for 4x4 matrix with 2x2 filter and 2x2 stride

A fully connected layer, as illustrated in figure 2.19, flattens the output from previous layers into a tensor and connects to every neuron, making it a cost-effective approach for learning non-linear combinations of high-level features from the convolutional layer This layer learns a potentially nonlinear function essential for image identification, requiring the transformation of the feature map into a format compatible with Multi-Level Perception Consequently, the image is converted into a column vector, and the output from the flatten operation is fed into a feed-forward neural network, where backpropagation is applied during each training iteration.

Zero Padding involves adding zeros to the edges of an input to minimize information loss in the corners, as shown in figure 2.20 By incorporating additional rows and columns around the image, Padding ensures that the input and output data sizes remain consistent This technique enhances the accuracy of image analysis performed by Convolutional Neural Networks (CNNs) and comes in two primary types.

 Valid (No padding): our input size will not be preserved

 Same (zeros surrounding the corner): the dimension of the output will remain unchanged

Figure 2.19 Visualization of fully connected layer [13]

Figure 2.20 Apply zero Padding for input matrix

Stride refers to the pixel count by which the window moves after each operation, leading to significant overlap of the filter This overlap allows for a greater sharing of features across the outputs Figure 2.21 illustrates the overlapping region when applying a 3x3 filter with a stride of one.

Increasing the step size results in fewer shared parameters among filters, leading to a reduction in the feature map size, as shown in figure 2.22 This increase in stride effectively downsamples the image, which can obscure lower-level features.

Back-propagation in neural networks is essential for adjusting the weights and biases of neurons in response to output errors This process relies on gradients provided alongside the error to facilitate updates Unlike neural networks, linear regression models do not utilize activation functions, which allows them to learn and perform complex tasks effectively.

The Rectified Linear Unit (ReLU) is the most widely used activation function in deep learning, particularly in hidden layers Its simplicity in mathematical calculations makes it less computationally expensive compared to functions like tanh and sigmoid ReLU promotes sparsity in the network by activating only a few neurons at a time, enhancing processing efficiency However, it does not handle negative values effectively.

Figure 2.21 Convolution for matrix 5x5 with stride = 1

Figure 2.22 Convolution for matrix 5x5 with stride = 2

18 will become zero value by this activation function which decrease the model's capacity to precisely match and train, as illustrated in figure 2.23

Leaky ReLU is an improved version of the ReLU activation function, featuring a slight negative slope for negative input values instead of a flat slope This modification addresses a key limitation of ReLU, which quickly converts negative values to zero, hindering the model's ability to learn from negative inputs By allowing a small negative slope, Leaky ReLU ensures that negative values are mapped more effectively, enhancing the model's training process The shape of the Leaky ReLU function is illustrated in Figure 2.24.

Figure 2.24 The graph of Leaky ReLU [13]

(2.12) Figure 2.23 The graph of ReLU [13]

The Softmax function converts a vector of numbers into a vector of probabilities, where each probability is inversely related to the scale of the corresponding value in the vector, as demonstrated in equation (2.13).

2.3.5 Various techniques for evaluating a deep learning model

The development of a deep learning model depends critically on model assessment To enhance accuracy and match the model, metrics must be examined There are four states:

 True positives (TP): The model predicts positive, ground-truth is positive

 False positives (FP): The model predicts Positive, ground-truth is negative

 True negatives (TN): The model predicts negative, ground-truth is negative

 False negatives (FN): The model predicts negative, ground-truth is positive

The confusion matrix is essential for grasping various classification metrics such as precision and recall, as it clearly shows the correct and incorrect predictions for each class, as demonstrated in figure 2.25.

Accuracy is the proportion of accurate predictions for the test data, as seen in equation

2.14 Classes with an imbalance cause low accuracy

Precision is the ratio of Positive samples accurately identified to the total number of

Positive samples classified, as seen in equation 2.15 The precision in identifying an instance as positive is used to determine the model's accuracy

(2.15) Figure 2.25 The confusion matrix in detail

PID controller

The PID controller aims to generate a suitable command to maintain system stability, utilizing three key parameters: KP (proportional), KI (integral), and KD (derivative) It processes an error signal, defined as the difference between the setpoint and the actual output, and adjusts the PID values to help the system reach the desired output The overall workflow of the PID controller is illustrated in Figure 2.26.

The fundamental terms for understanding PID controller are as follows:

Proportional (P) gain measures the current error in a system, where a high proportional gain leads to a substantial change in output for any given error change However, if the proportional value is too high, it can result in system instability Conversely, a minimal increase in proportional gain may not adequately enhance the system's response to significant input errors.

The integral term (I) is crucial in controlling errors over time (t), as it enhances the system's convergence to the setpoint while reducing the steady-state error typical of a pure proportional controller By responding to cumulative errors, the integral term can cause the current value to exceed the setpoint Its primary objective is to eliminate steady-state error; however, when the error reaches zero, the integral remains positive, which may lead to overshooting Following this overshoot, the integral will subsequently decrease.

The derivative (D) represents the rate of change of error, calculated by determining the slope of the error over time and multiplying it by the derivative gain, Kd This gain signifies the impact of the derivative term on the overall control action, enhancing both the settling time and stability of the system.

 Error is the difference between the system's desired setpoint and output feedback

In continuous time, the overall control function, as seen in equation 2.18: i( ) = j k ( ) + j / l (m) m + j n ( )

In discrete time, the equation follows below, as seen in equation 2.19 : i op = j k ⋅ + j / ⋅ + / + j n ⋅ q − rp

In control systems, the proportional gain (\$j_k\$), integral gain (\$j/\$), and derivative gain (\$j_n\$) are crucial parameters that influence system performance The error (\$e\$) represents the difference between the target input and the feedback value of the system Time (\$t\$) denotes the instantaneous moment, while the variable of integration (\$m\$) spans values from time 0 to the current time \$t\$.

Random sample consensus algorithm (RANSAC)

Traditionally, the least squares method has been employed to fit a straight line, serving as a fundamental statistical approach for determining the best-fitting regression line This technique is characterized by an equation with fixed parameters and is widely used in evaluation and statistical modeling In regression analysis, least squares is regarded as a conventional strategy for approximating sets of equations, particularly when there are more equations than variables to analyze.

The least squares method minimizes the sum of the squared deviations, effectively reducing errors in each equation's response This is achieved by calculating the square root of the total sum of these squared deviations.

Figure 2.26 The map of PID operation

22 the sum of squares of errors, which will aid you in determining the observed data's variance Figure 2.27 depicts the raw data before applying a fitting algorithm

Minimizing the sum of squares is a common technique in data fitting, where the sum of squared errors represents the differences between observed values and the corresponding fitted values from the model The goal is to achieve the best-fit outcome by reducing these squared errors.

The least-squares distance technique is one of the most used ways There are three steps:

- Create a cost function that totals the distance between each point and the line

- Adjust the slope of the line's equation and analyze the cost function iteratively

- The line with the lowest cost function should be chosen

While the least squares technique is often regarded as the most effective approach to locating the best fit line, it has a few drawbacks They are as follows:

- This methodology just illustrates the connection that exists between the two variables There is a complete disregard for any other potential causes and consequences

- When data is not uniformly distributed, this strategy is not dependable This approach is very sensitive to extreme values In reality, this may bias the least- squares analysis findings

Figure 2.27 Sample data with no fitting algorithm

Figure 2.28 The sample result of the least squares technique in the absence of noise

The least squares technique is effective for noise-free data but struggles with noisy datasets, as shown in Figures 2.28 and 2.29 To address this issue, the RANSAC (Random Sample Consensus) algorithm offers an iterative method for estimating mathematical model parameters from observed data that includes outliers RANSAC is particularly useful in scenarios where outliers do not influence the estimates, effectively serving as an outlier identification tool This non-deterministic algorithm yields relevant results with a certain probability, which improves with increased iterations Introduced by Fischler and Bolles of SRI International in 1981, RANSAC was initially applied to solve the Location Determination Problem (LDP), which involves identifying spatial locations that correspond to known landmark coordinates in an image.

The RANSAC method estimates model parameters by randomly sampling data, effectively handling datasets with both inliers and outliers It employs a voting mechanism to identify the best-fitting model, as shown in figure 2.30 In this process, data components cast votes for various model candidates, relying on two key assumptions: noisy features do not consistently support any model, and there are sufficient characteristics to agree on a reliable model despite some missing data The RANSAC algorithm operates in two iterative phases.

In the initial stage, a randomly selected minimum subset of the input dataset is utilized to fit a model and establish its parameters, with the size of this sample subset influencing the determination of the model parameters.

The method assesses the compatibility of the entire dataset with the model defined by the estimated parameters A data element is classified as an outlier if it deviates from the fitting model beyond a specified error threshold, which accounts for significant variations due to noise.

Figure 2.29 Sample outcome of the least squares method with a few noisy data points

The consensus set represents the inliers of the model RANSAC iteratively performs two processes until the consensus set has a sufficient number of inliers It requires observable data values, a model for data fitting, and confidence parameters as inputs The RANSAC algorithm follows a series of stages to achieve its results.

1 Choose a portion of the original data at random This subset is known as the hypothetical inliers

2 In this step, a model is fitted to the data, including the hypothetical outliers

3 The remaining data are then compared to the model Those points that match the estimated model well according to a model-specific loss function are included in the consensus set

4 The estimated model will likely be good if enough points have been put into the consensus set

5 The estimated model is likely to be a good one if the consensus set has a suffici ent number of points

The RANSAC method may be exemplified by the algorithms that are described as follows

A model serves to explain observed data points, with a minimum number of data points, denoted as $ n $, necessary for estimating its parameters The algorithm is constrained by a maximum number of iterations, represented by $ k $, while a threshold value $ t $ is used to identify data points that align well with the model Additionally, $ d $ indicates the number of close data points needed to confidently assert that the model fits well.

Figure 2.30 The graphic on the left describes the components that make up a raw data collection The illustration on the right provides a RANSAC description of a fitted line

Outputs: BestFit – model parameters that best fit the data (or null if no good model is found)

The algorithm iteratively selects random values from the data to identify potential inliers and fits a model to these values For each data point not included in the initial selection, it checks if the point aligns with the model within a specified error threshold If the number of inliers exceeds a defined limit, a new model is fitted to both the initial inliers and the newly identified inliers The algorithm evaluates the fit of this new model, updating the best model if it demonstrates a lower error than previously recorded This process continues until a predetermined number of iterations is reached, ultimately returning the best-fitting model.

TENSOR-RT PATTERN

NVIDIA developed TensorRT, a library aimed at accelerating inference on its GPUs, utilizing the parallel programming architecture CUDA This technology significantly enhances the speed of inference for various real-time services and embedded applications, with documentation indicating it can deliver up to 40 times faster inferences compared to CPU-only systems.

For deep learning models, TensorRT conducts five kinds of optimization This essay [5] will go through all five optimization methods, as illustrated in 2.31

Throughout the training process, parameters and activations have FP32 (Floating Point

32) accuracy To convert them to FP16 or INT8 precision, thus This optimization decreases latency and significantly lowers model size by converting FP32 precision to FP16 or INT8, as illustrated in figure 2.32

FP16 has a shorter dynamic range than FP32, which can lead to weight shrinkage due to overflow; however, this does not affect accuracy in practice The model effectively retains essential features during training while discarding unnecessary elements Although reducing model accuracy is a part of this process, it is assumed that the model eliminates noise This method of trimming overflowing weights is ineffective when transitioning to INT8 precision, as INT8 values range from [-127 to +127], causing most weights to overflow and significantly degrade model precision To address this, we utilize scaling and bias terms to map these weights in INT8 precision.

We will focus on the scaling factor, as the bias component does not contribute any value To determine the scaling factor, we convert the FP32 values of -|max| and |max| to the INT8 values of -127 and 127, respectively, while the other values are assigned based on a linear scale.

Figure 2.31 TensorRT uses a variety of optimization techniques to help models run faster

Figure 2.32 The dynamic range of different precisions

A significant decrease in accuracy may occur when mapping threshold values to the range of -127 to 127, instead of using maximum values Any threshold outliers are also mapped to these extreme values, as shown in figure 2.33.

During the execution of a graph in deep learning systems, a comparable calculation is often necessary TensorRT optimizes GPU memory and bandwidth through layer and tensor fusion, which involves merging nodes in a kernel both vertically and horizontally This process reduces the overhead and costs associated with reading and writing tensor data for each layer An analogy for this optimization is making a single trip to purchase three products instead of three separate trips For instance, TensorRT can combine layers with similar input and filter sizes but different weights in a 1x1 CBR layer, as shown in figure 2.34.

During the optimization phase, kernel-specific enhancements are applied to models, with the selection of optimal layers, algorithms, and batch sizes tailored to the GPU platform's goals For instance, in convolution operations, TensorRT (TRT) automatically identifies the most efficient method based on the designated platform.

Figure 2.34 GoogLeNet's Inception module graph is reduced in compute and memory cost because of TensorRT's vertical and horizontal layers integration

TensorRT reduces memory consumption by only allocating memory to a tensor when used This technique allows faster and more efficient execution since it minimizes memory use while minimizing allocation overhead

TensorRT was built from the ground up to handle many input streams simultaneously This is essentially the CUDA stream that Nvidia uses.

LiDAR

The article outlines three fundamental processes: data pre-processing, breakpoint identification, and line extraction Data pre-processing focuses on organizing and adjusting sensor bias for low-level data Breakpoint detection identifies uninterrupted measurement sequences amidst scanning surface shifts, while also examining the implementation and tuning of detectors Line extraction utilizes kernels on each continuous scan sequence within a range image.

Adaptive Breakpoint Detector categorizes 2D points (ABD), as illustrated in figure 2.35

To cluster n points, the Euclidean distance between _ x and _ xrp is determined If their distance is less than y z{ , they're in a cluster, as seen in equation 2.20

Raw 2D LiDAR data becomes increasingly sparse with distance from objects, necessitating an adjustable maximum distance (Dmax) based on the LiDAR's proximity to its surroundings To enhance data accuracy, it is essential to modify Dmax accordingly Geovany established an adaptive threshold by utilizing the scanning range (T x), which may improve the effectiveness of LiDAR measurements.

For the (e − 1) scan, a preset line passes through the point _ xrp and produces an

Figure 2.35 An adaptive Breakpoint Detection algorithm

# angle A straight line is used to determine the maximum permissible range The law of Sines in Δ~•E was applied to estimate y z{ :

Where, Δ∅ is the angle formed between two consecutive points _ x and _ xrp

T xrp is ~_ xrp edge y z{ = ‖_ x − _ xrp ‖ (2.22)

To get the distance between two points in a triangle, we used the Cosine rule to derive

D max from equation (2.21), as illustrated in equation (2.22) If the distance between points _ x and _ xrp is higher than y z{ , the points are in a cluster, as seen in equation 2.23

Breakpoint detectors are essential tools for pinpointing specific areas of interest, with two simple detectors highlighted in this research The article focuses on both qualitative and quantitative comparisons utilizing simulated and real data, alongside an assessment of the framework's effectiveness when alternative methods for identifying breakpoints and lines are applied.

Appy this technique on the autonomous robot system for different cluster objects on the road or other, as illustrated in figure 2.36

The ABD algorithm was applied to an autonomous robot system, as illustrated in Figure 2.36 Part (a) shows the system operating in a real environment with three distinct objects, while part (b) displays the output of the ABD technique, where green indicates object 1, yellow represents object 2, purple denotes object 3, and pink signifies the wall.

HARDWARE PLATFORM

OVERVIEW SYSTEM

This section describes the roles of each hardware component and their relation to one another Overall, the entire hardware system is depicted in Figure 3.1

The Buggy Desert is a high-speed, 1/10-scale remote-control automobile that is ready to use immediately, as shown in Figure 3.2 It operates on 2.4GHz radio frequency signals with a range of up to 150 meters and features full steering and forward/reverse control We selected the Dessert Buggy for its excellent controllability and car-like steering mechanism, with detailed specifications provided in Table 3.1.

Table 3.1.Specifications of the car

Figure 3.1 The overall hardware platform

3.1.2 16 Channel PWM controller circuit PCA9685

The PCA9685 is a 16-channel I2C-bus driven LED controller that enhances RGBA color backlighting applications Each LED output is managed by a separate 12-bit resolution PWM controller, allowing for 4096 steps of control The customizable duty cycle for each channel ranges from 0 percent to 100 percent, providing precise control over the brightness and color output.

The 16-channel 12-bit PWM Servo Driver, utilizing the PCA9685 master chip, enables precise control of LED brightness and operates at a configurable frequency ranging from 24 Hz to 1526 Hz This driver requires only 2 pins to manage 16 servos, greatly minimizing the necessary I/O connections Furthermore, it can be cascaded with up to 62 additional driver boards, allowing for the control of a total of 992 servos Detailed specifications of the PCA9685 are provided in Table 3.2.

The module utilizes the PCA9685 chip as its controller, enabling the regulation of 16-channel PWM output values By configuring this controller, we can precisely adjust the PWM frequency and duty cycle for accurate servo control The LEDn ON and LEDn OFF registers facilitate the management of each LED driver's output.

Car battery 7.4V 2200mAh LiPo battery

Transmitter battery 4 * AA batteries (not included)

Figure 3.3 16 Channel PWM controller circuit PCA9685

32 on time and the PWM duty cycle may be individually regulated With LED ON time set at

409 and LED OFF time set to 1228, as illustrated in figure 3.4, the PWM duty cycle must be:

Table 3.2 Specifications of 16 Channel PWM controller circuit PCA9685

Specifications of 16 Channel PWM controller circuit PCA9685

Communication 12C (accept 3~5VDC TTL Logic level)

A radio-frequency module is a compact electronic device that facilitates the transmission and reception of radio signals between two devices, as shown in figure 3.5 This wireless connectivity is essential for embedded systems, and these modules come in various shapes and sizes The transmitter captures stick inputs and wirelessly sends them to the receiver, which relays the information to the RC robot, enabling it to move accordingly The radio module features four distinct channels for each stick direction, along with additional channels for auxiliary switches Detailed specifications for the Devo7 transmitter and RX701 are provided in tables 3.3 and 3.4.

Figure 3.4 The operation of LED control by PCA9685

Figure 3.5 Devo7 transmitter (left) and the receiver (right)

Table 3.3 Specifications of Devo7 transmitter

- Used for manual control mode, Devo7 is the transmitter with three channels

- The maximum distance can reach 80m

- RX701 is used to send and receive the signal

SLAMTEC has developed the RPLIDAR A1, an affordable 360-degree 2D laser scanner (LIDAR) system This device can perform a complete scan within a range of 12 meters, or 6 meters for the A1M8-R4 and its variants, generating 2D point cloud data suitable for mapping, localization, and object/environment modeling The RPLIDAR A1 captures 1450 points per rotation at a scanning frequency of 5.5 Hz, as illustrated in the accompanying figure.

Encoder ARM micro computer system

Battery 1.2VX8 NiCard or1.5VX8 AA dry batteries

Figure 3.6 RP-Lidar A1 system composition

The RPLIDAR A1 is a laser triangulation measuring device with a maximum configuration of 10 Hz, capable of operating effectively in both indoor and outdoor environments, provided they are not directly exposed to sunlight.

The RPLIDAR A1 features a motor system and a range scanning system that activate before it begins rotating and scanning in a clockwise direction Users can access range scan data via a communication link, such as a Serial port or USB This device includes a speed detection and adaptive system that automatically adjusts the laser scanner's frequency according to the motor speed, allowing the host system to receive real-time speed data Additionally, the straightforward power supply design lowers the bill of materials (BOM) cost and enhances the ease of operation for the RPLIDAR A1 Detailed specifications for power and communication interfaces are provided in the following sections.

The RPLIDAR A1 sensor delivers high-resolution distance measurements at over 800 times per second by emitting a modulated infrared laser signal that reflects off target objects The vision acquisition system samples the returning signal, while the integrated DSP processes the data to determine the distance and angle to the object, transmitting this information via a communication link Detailed specifications for RPLIDAR are provided in Table 3.5.

Figure 3.7 The RPLIDAR A1 working schematic

- Laser scans are used to collect point clouds in 2D space

- The purpose of using LiDAR is to identify the appropriate space barrier

The IMX335 camera offers a portable and lightweight solution for users, easily connecting to the Jetson TX2 via USB With the ability to capture 30 frames per second, it delivers a smooth video stream featuring vibrant colors and excellent contrast Additionally, this camera records in Full HD, as shown in Figure 3.8.

Table 3.6 Specifications of IMX335 5MP USB Camera (A)

Focal length (EFL) 3.91 mm Field of view (FOV) 106°(D) 92.6°(H) 48.6°(V)

Image Format support MJPG, YVY2

Output UART Serial (3.3 voltage level)

Range Resolution ≤1% of the range（≤12m)

1% of the range（≤3 m） 2% of the range（3-5 m） 2.5% of the range（5-25m）

Figure 3.8 IMX335 5MP USB Camera (A)

Supported operation system Windows, Linux, mac OS

- Acquire RGB images for the processor to use in deploying Computer Vision tasks

The Jetson TX2 is a powerful supercomputer-on-a-module that simplifies the design of hardware and software through artificial intelligence, as shown in figure 3.9 With its 256-core NVIDIA PascalTM architecture GPU, quad-core ARM A57 and Denver 2 CPU, it offers high AI processing power, making it an ideal choice for intelligent edge devices like robots, drones, smart cameras, and portable medical equipment Furthermore, it is a cost-effective solution for developers.

The SDK offers a comprehensive suite for developers, featuring BSP, CUDA, cuDNN, TensorRT, and NVIDIA Jetpack It supports a variety of AI frameworks, including TensorFlow, PyTorch, Caffe/Caffe2, Keras, and MXNet, making it a versatile tool for AI development.

The Jetson TX2 offers twice the performance and energy efficiency compared to the Jetson TX1, making it ideal for smart city applications, manufacturing robots, and prototyping It is compatible with Jetson TX1 modules and supports more advanced deep neural networks Detailed specifications for the Jetson TX2 can be found in Table 3.7.

Table 3.7 Specifications of Jetson TX2 [18]

- The Jetson Tx2 is a potent central processor capable of handling data collection, deep learning, and AI applications

- The Jetson TX2 is employed in our system to handle camera and LiDAR information

The Arduino Uno is the most popular model due to its affordability and numerous I/O ports It includes 20 digital input/output pins, with 6 capable of PWM output and 6 serving as analog inputs Additionally, it features a 16 MHz resonator, a USB connection, a power jack, an ICSP header for in-circuit programming, and a reset button.

CPU Dual-Core NVIDIA Denver 2 64-Bit CPU Quad-Core

ARM® Cortex®-A57 MPCore GPU 256-core NVIDIA Pascal™ GPU architecture with 256

Memory 8GB 128-bit LPDDR4 Memory 1866 MHx - 59.7 GB/s

Up to 6 cameras (12 via virtual channels)

12 lanes MIPI CSI-2 D-PHY 1.2 (up to 30 Gbps) C-PHY 1.1 (up to 41Gbps)

1x 4Kp60 3x 4Kp30 4x 1080p60 8x 1080p30 (H.265) 1x 4Kp60 3x 4Kp30 7x 1080p60 14x 1080p30 (H.264) Video decoder

Display 2 multi-mode DP 1.2/eDP 1.4/HDMI 2.0

2 x4 DSI (1.5Gbps/lane) Networking 10/100/1000 BASE-T Ethernet, WLAN

38 figure 3.10 The ATmega328-based Arduino Uno is a microcontroller board The specification of Arduino Uno R3 shows in table 3.8

Table 3.8 Specifications of Arduino Uno R3

A lithium-polymer battery (LiPo) is a rechargeable power source that utilizes solid polymer as the electrolyte and lithium as one of its electrodes Commercially available LiPo batteries are typically hybrids, incorporating gel polymer or liquid electrolyte within a pouch, and are more accurately referred to as lithium-ion polymer batteries In this study, we utilize two specific batteries: the Gaoneng GNB 7.4V 5500mAh 70C 2S LiPo Battery and the Lipo Tiger 3S 11.1V 3500mAh 30C, which provide power for all systems, as shown in figures 3.11 and 3.12 The specifications for these batteries are detailed in tables 3.9 and 3.10.

Max is currently on a single I/O 40 mA

DC current on I/O pints 40 mA

DC current on 3.3V pin 50 mA

Figure 3.10 Pinout of Arduino Uno R3

Table 3.9 Specifications of Lipo Tiger 3s 11.1V 3500mah 30C

Table 3.10 Specifications of Gaoneng GNB 7.4V 5500mAh

Dimensions 130mm x 45mm x 25mm (LxWxH)

Dimensions 134mm x 44mm x 19mm (LxWxH)

Figure 3.11 Gaoneng GNB 7.4V 5500mAh 70C 2S Lipo Battery

- Lipo Tiger 3s 11.1V 3500mah is supply power for Jetson TX2

- Gaoneng GNB 7.4V 5500mAh supplies power for the motor, PCA 9685, and servo

SOFTWARE DESIGN

Object Detection Algorithm

YOLOv4 is recognized as one of the most efficient iterations of the YOLO series, offering enhanced speed and accuracy compared to its predecessors The advancements detailed in the YOLOv4 paper [12] highlight updates to existing models and the introduction of new features.

Selecting the right Backbone is crucial for enhancing object detection performance, as its main function is to extract vital information Typically, pre-trained neural networks are employed to train the Backbone, with VGG16 being one of the popular choices.

[19], ResNet-50 [20], SpineNet [21], EfficientNet-B0/B7 [22], CSPResNeXt50 [23], CSPDarknet53 [23] In [12], the Author selects CSPDarnet53 according to other studies comparing CSPDanet53 with other Backbone on some datasets

Cross-Stage Partial (CSP) is a key feature of CSPNet, aimed at enhancing design resilience while minimizing processing demands This is accomplished by splitting the base layer's features into two segments and integrating them through a proposed cross-stage hierarchy, as shown in figure 4.2 Consequently, CSPNet not only decreases computational requirements but also improves inference accuracy, making the Cross Stage Partial architecture highly effective.

45 developed from the DenseNet architecture, which concatenates the previous input with the current input before entering the dense layer

Figure 4.2 Illustration of (a) DenseNet and (b) Cross Stage Patial DenseNet (CSPDenseNet)

Darknet-53 is the backbone of object detection in YOLOv3, featuring 53 layers of convolutional neural networks enhanced by residual connections and additional layers As depicted in figure 4.3, YOLOv4 incorporates CSP connections for improved feature extraction, while utilizing Darknet-53 for foundational processing However, as the number of layers increases, the context of learned features diminishes, prompting researchers to introduce skip connections to facilitate the backpropagation of gradients to earlier layers.

DenseNet features skip connections between each layer, but its numerous parameters complicate training and predictions The CSPResNext50 and CSPDarknet53 are built on DenseNet's architecture, enhancing it by decoupling the feature map from the base layer This is achieved by duplicating the feature map, sending one copy through the dense block and the other directly to the next stage The primary aim of CSPResNext50 and CSPDarknet53 is to eliminate processing bottlenecks present in DenseNet and to improve learning by transmitting an unaltered version of the feature map.

On ResNext-50, cross-stage partial connection was attempted, and a 20 percent increase was seen

Figure 4.3 The structure of CSPDarknet53

The Neck in backbones enables the extraction of extensive semantic features for accurate predictions, with the receptive field acting as a crucial benchmark This method enhances the recognition of items of varying sizes and improves the information processed by the Head.

Spatial Pyramid Pooling (SPP) enhances the receptive field and extracts essential features from the Backbone by utilizing SPP blocks after CSPDarknet53 This technique involves processing an input image through convolutional layers to generate a feature map, followed by applying max pooling with varying window sizes to create a comprehensive feature set, as depicted in figure 4.4 Subsequently, YOLOv4 divides the features into depth-based segments, applies SPP to each segment, and merges them to produce the final output feature map.

Figure 4.4 Spatial Pyramid Pooling with 3 scales [24]

- Path Aggregation Network (PAN): PANet has the capacity to properly store spatial information, which assists in the accurate localization of pixels for mask creation

Figure 4.5 (a) illustrates Bottom-up Path Augmentation, where the complexity of features increases through the neural network layers while the spatial resolution decreases, leading to challenges in accurately identifying pixel-level masks The Feature Pyramid Network (FPN) addresses this by utilizing a top-down path to extract and merge semantically rich features with precise localization In contrast, PANet introduces an additional bottom-up route to the FPN's top-down approach, enhancing efficiency by incorporating clean lateral connections from lower to higher layers, known as "shortcuts," which typically span around 10 layers.

Adaptive Feature Pooling is illustrated in Figures 4.5 (b), (c), and (d), showcasing how PANet utilizes features from various layers to identify the most effective ones The ROI Align operation is performed on each feature map to extract object features, allowing the network to learn useful features through an element-wise max fusion process Additionally, Figure 5 (e) depicts Fully-connected Fusion, which is sensitive to location and adapts to different spatial contexts By integrating information from these two layers, PANet enhances the accuracy of mask predictions.

In mask predictions with adaptive feature pooling, PANet typically combines neighboring layers; however, in YOLOv4, we utilize a concatenation operation on these layers instead of simple addition, as shown in figure 4.6 This approach leads to more accurate predictions.

Figure 4 6 Illustrations of (a) PAN and (b) modified PAN

Based on Yolov4's technique, yolov4-tiny can recognize objects more quickly Embedded systems and mobile devices now have a better chance of successfully

The Yolov4-tiny approach utilizes the CSPDarknet53-tiny network as its backbone, replacing the standard CSPDarknet53 network In this architecture, CSPBlock modules are implemented instead of ResBlock modules within the CSPDarknet53-tiny cross-stage partial network The CSPBlock design divides a feature map into two sections, which are then merged through cross-stage residual edges, enhancing gradient information correlation by allowing it to traverse two distinct network pathways.

Figure 4.7 The structure of YOLOv4-tiny

The Yolov4-tiny technique enhances object recognition by utilizing a distinct feature fusion approach compared to the Yolov4 method, which relies on spatial pyramid pooling and route aggregation networks By incorporating feature maps of sizes 13x13 and 26x26, Yolov4-tiny effectively improves the accuracy of detection outcomes.

Road Lane Detection Algorithm

Lane detection systems face two main challenges: complex settings and high speeds Effective lane detection, especially under severe occlusion and extreme lighting, relies on global and contextual information A recent publication introduced a formulation that addresses these challenges with quick processing in complicated scenarios By leveraging global features, the computational cost associated with row-based selection issues is minimized The study also proposes a structural loss to explicitly model lane structures, as illustrated in Figure 4.8, which provides an overview of the UFLD architecture algorithms.

Figure 4.8 The overall of Lane- line detection algorithm [26]

The three main points of this work are as follows:

- Speed and no visual clues are addressed in a new lane detection formula

- The suggested formulation shows a structural loss that explicitly incorporates prior information on lanes

- The suggested method accomplishes state-of-the-art performance on the complex CULane dataset in terms of accuracy and speed

This study presents a row-based selection method that utilizes global image features for lane recognition The approach identifies suitable lane positions on predetermined rows by leveraging these global characteristics Lanes are represented as a series of horizontal positions, known as row anchors, at specified rows To accurately depict these positions, a grid system is employed, dividing the placement of row anchors into multiple cells across each row The formula for determining the lane can be articulated as follows:

 W: is the width of an image

 H: is the height of an image

 ˆ {‡ : the classifier used to determine lane placement on the a-th lane, b-th row anchor

 _ {,‡,: The (w+1)-dimensional vector represents the probability of selecting (w + 1) gridding cells for the a-th lane, b-th row anchor

 ^ {,‡ : the one-hot label of correct locations

 X: The global features of an image

They developed row-based selection techniques through cross-entropy loss as described by:

The loss is determined by comparing the predictions $ \hat{y} $ and the actual labels $ y $ Here, $ N $ denotes the number of lane lines, while $ h $ indicates the number of specified rows.

The variance of estimates and labels was determined using cross-entropy cost, which, along with structural and segmentation losses, played a crucial role in the formation of UFLD Structural loss effectively connects all predetermined rows, as demonstrated by Zequn's use of lane structural loss to establish a close relationship among these rows This approach employs a classification vector to indicate lane locations, and by constraining the distribution of classification vectors across consecutive row anchors, the continuous property of lanes is maintained.

In the original paper, the author introduced the term "shape loss" to describe the structural loss that connects various pre-defined row anchors This localization loss offers two key benefits: the expectation function is differentiable, and it enables the model to link discrete rows continuously The shape loss can be expressed mathematically as follows:

According to the equation, Š Œ k = ∑ ∑ • ŠS ‘ / 3 {,‡ − ŠS {,‡op − (ŠS {,‡op − ŠS {,‡o: )• p (4.6) Where: Loc {,‡ is the location on the a-th lane, and the b-th row anchor

Prob /,3,: is the probability of the a-th lane, the b-th row anchor, and the k-th location

The final structural loss is the combination of Š Œ/z and Š Œ k with the coefficiency ” illustrated in equation: Š Œ • = Š Œ/z + ”Š Œ k (4.7)

The model incorporates additional segmentation that derives global features from various backbone feature map sizes The overall loss of the model is calculated as the sum of three distinct losses, represented by the equation: \$L = L_{seg} + P L_{G} + \lambda L_{V}\$, where \$L_{V}\$ denotes the cross-entropy loss associated with the segmentation branch.

ULFD's resilience has been enhanced by prioritizing real-world performance over simulations With the integration of structural loss and segmentation assistance, the model operates at least four times faster than previous state-of-the-art methods while delivering comparable results.

Design of the Steering Controller

A set of coordinates represents the precise lane lines detected on the road, with a maximum of three lanes identified on our campus (denoted as _p, _, _Ÿ from left to right) The left lane is ignored to ensure compliance with traffic laws while navigating to the next destination.

To determine the steering angle, we can utilize the two remaining lane lines Initially, we calculate the average of two sets of coordinates, _ : , and _ Ÿ , to establish the center point of the right lane, which serves as the predicted position for the next state The subsequent examples illustrate the process to achieve the desired outcome.

N represents the number of lane lines, while R denotes the number of predefined rows The steering angle is derived from the previously calculated midway point However, geometry presents certain limitations, particularly in relation to steering angles, which can result in undesirable sounds.

In place of this, an ideal PID controller would monitor the steering angle.

EXPERIMENTS AND RESULTS

Tiêu đề	Research, Design, and Construct Lane Tracking and Obstacle Avoidance System for Autonomous Ground Vehicles Based on Monocular Vision and 2D-Lidar
Tác giả	Lê Trung Lĩnh
Người hướng dẫn	Assoc. Prof. Lê Mỹ Hà
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Automation and Control Engineering
Thể loại	Graduation project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	83
Dung lượng	3,41 MB