INTRODUCTION
Proposal
The concept of autonomous vehicles dates back to the mid-20th century, initially dismissed as unrealistic and relegated to the realm of fiction In 1958, Disney showcased this vision in a show that depicted highways populated by self-driving cars, utilizing colored bands and coded statuses on punch cards It wasn't until the 1980s, however, that significant advancements in processing technology began to turn this long-held dream into a reality.
The development of autonomous vehicles relies heavily on advanced technologies, including deep learning for terrain analysis and trajectory planning, which are achieved through the integration of multiple sensors that create a comprehensive view of the environment A notable approach involves using a multi-camera system that provides a 360-degree field of view for 3D mapping and object detection, as discussed in research [1] To enhance vehicle position tracking accuracy, the Extended Kalman Filter (EKF) has been employed, as demonstrated in paper [2], which utilizes radar and LiDAR sensors Similar methodologies, such as the fusion of odometry and LiDAR for the "turtlebot" robot, also incorporate EKF [3] However, the high cost of LiDAR technology poses challenges for student projects, prompting researchers to explore more affordable solutions, such as camera fusion combined with GPS and IMU sensors, as outlined in paper [4].
This thesis presents a solution for a small-scale self-driving car model, utilizing an RC car and multiple sensors to minimize computational load Titled "Design and Implementation of Lane Keeping and Navigation System for Self-Driving Vehicle Based on the Fusion of Camera, GPS, and IMU," the study focuses on integrating various technologies to enhance navigation and lane-keeping capabilities in autonomous vehicles.
Research Objective
The primary aim of this research is to develop an autonomous vehicle that operates effectively on the HCMUTE school campus Our group's objective involves constructing suitable hardware and research methods for integration into this system This initiative lays the groundwork for future efforts to implement a full-scale self-driving car on public roads.
Limitation
Due to the nature of this thesis as laying the foundation for future work there exist some glaring limitations:
- The model cannot operate on actual road as the camera angle is restricted
- The result of this thesis is experimental Therefore, our car model can only operate within the confides of HCMUTE campus
- The model does not run well in extreme weather conditions: raining, bright light, absolute darkness, etc
- Sensors such as GPS and IMU experience large noise in outdoor condition
- The method needs to be lightweight enough in order to run on our hardware in real time Therefore, some more precise methods are not used in this thesis.
Research Content
The research and implementation of this project consists of the following:
- Content 1: Research the theoretical array math related to self-driving cars to give directions and handle problems during model building
This project emphasizes the effective utilization of essential components, including sensors, microcontrollers, and motor types, with a specific focus on the NVIDIA TX2 circuit It incorporates a camera recorder, Arduino Uno R3, integrated GPS on smartphones, and various other microcontrollers to enhance functionality and performance.
- Content 3: Research using BiSeNet network on photos taken from real terrain, including paths, pedestrians
- Content 4: Focus on the goal of automatic control algorithm to help car model perform the proposed algorithm
- Content 5: Build and learn how to use Arduino Uno on NVIDIA Jetson TX2 to receive control signal for motor and servo from DEVO7
- Content 6: Learn how to control servo with GPIO I2C using NVIDIA Jetson
- Content 7: Research using GPS to localize car position in HCMUTE school campus and assist in fork turn
- Content 8: Apply Kalman Filter to filter GPS noises and state estimation for autonomous vehicle.
Thesis Summary
This thesis is contributed by the following chapters:
Chapter 1: Introduction: This chapter present problem statement, objective, research content and research scope
Chapter 2: Theoretical basis: In this chapter, theories related to the project are used to design and construct model
Chapter 3: Design and calculation: This chapter provides an overview on the topic's requirements as well as how to calculate and design, including sections such as: system block diagram design, circuit diagram, mathematic algorithms
Chapter 4: Hardware Assembly: Presenting the mechanical and electrical part that the group of students performed for the self-driving car model Write a program for the system Perform model simulation
Chapter 5: Experimental Results: Presenting the results obtained in comparison to the objectives established during the research process Then, analyze and compare simulation and reality findings
Chapter 6: Conclusions and Future Works: This chapter discusses the project's outcomes and limits, from which conclusions and development directions are made to solve the unresolved difficulties and conclude the project.
THEORETICAL BASIS
Technologies used in Self-driving Car
Cameras have become essential components of modern vehicles, with most cars now featuring reverse cameras equipped with curbside monitoring software to assist with parking Vehicles with lane departure warning systems utilize main direction cameras to identify road markings, a feature also found in self-driving cars Additionally, many recently developed vehicles include lighting cameras for lane detection, while multi-function systems often offer 360-degree cameras for a comprehensive view of the surroundings Furthermore, object detection and recognition programs effectively utilize camera images, enabling integration with AI algorithms for enhanced recognition capabilities.
Figure 2.1 Camera on self-driving car [5]
Despite its innovative design, the system faces significant challenges The camera, mimicking human vision, struggles in extreme light conditions, which can disrupt data collection and potentially cause program failures Additionally, the risks associated with high-speed unmanned vehicles pose threats to surrounding traffic Moreover, integrating multiple cameras to minimize blind spots generates a substantial volume of data that must be processed in real-time, placing considerable demands on computing hardware.
GPS is a satellite-based navigation system originally developed by the US Department of Defense in the 1970s for military use, which has since become essential for civilian applications It provides users with accurate location information anywhere on Earth, though it operates on a one-way communication model, allowing only signal reception due to government regulations limiting unauthorized access GPS technology is commonly integrated into moving vehicles, including self-driving cars, to guide navigation and transmit the vehicle's location to monitoring servers.
Figure 2.2 GPS tracking for Ground vehicles [7]
Figure 2.3 IMU sensor a) IMU sensor based on three types of sensors b) IMU sensor on plane c) IMU sensor in mobile phone
An Inertial Measurement Unit (IMU) is an electrical device that combines accelerometers, gyroscopes, and magnetometers to measure and report a body's specific force, angular rate, and sometimes orientation IMUs are widely used in modern vehicles, including motorcycles, missiles, airplanes, unmanned aerial vehicles (UAVs), and spacecraft like satellites and landers Recent technological advancements have led to the development of IMU-enabled GPS systems, which allow GPS receivers to estimate their position when GPS signals are unavailable, such as in tunnels or buildings, by utilizing various integrated sensors within the IMU.
2.1.4 Deep Learning Applications in Self-driving Car
One Shot Learning is a prominent area of research in Artificial Intelligence, particularly within Deep Learning, with numerous publications emerging annually Researchers focus on applying Convolutional Neural Networks (CNN) to self-driving cars through two main approaches The first approach is End-to-End Learning, where the CNN entirely controls the driving angle by predicting the steering angle using a Fully Connected Network.
The end-to-end learning method for self-driving cars processes input images through a series of convolutional layers that extract essential features These features are then flattened and passed to a fully connected network, which generates the output control signal for vehicle operation.
The end-to-end method for steering angle control, while simple, has significant drawbacks due to its difficulty in managing output stability, as steering angle interference can disrupt the input and output layers of the CNN network In contrast, the Multi-task Learning approach, which has gained popularity in recent years, enhances productivity by dividing the workload among multiple components; for instance, one CNN network identifies the path contour while another algorithm computes the steering angle based on these coordinates This division of tasks improves the algorithm's resilience against interference from high-order controllers.
- Road Scene Understanding (Semantic Segmentation): Semantic Image
Segmentation plays a major role in Road Scene Understanding which is crucial in
The perception system in self-driving cars utilizes an Image Segmentation Network to identify and segment objects such as cars, pedestrians, roads, traffic signs, lane markings, and buildings from input images This segmentation is crucial for developing control algorithms and navigation strategies that enable autonomous vehicles to operate effectively in urban environments.
Figure 2.5 Image segmentation for road scene understanding The above figure demonstrates the segmentation process using an Encoder-Decoder based Convolutional Network
- Object Detection: The reliability and portability of the new One-Stage Object
Detection algorithms such as YOLO and Fast-RCNN significantly enhance processing efficiency by focusing on identifying objects within a defined area, known as a bounding box This approach contrasts with Semantic Segmentation, which requires learning and mapping every pixel to specific labels, making detection algorithms more streamlined for object localization tasks.
Figure 2.6 Object Detection in Self-driving applications
Figure 2.7 Example of a One-stage detector – The YOLO architecture [8] The DarkNet architecture act as a feature extractor.
Convolutional Neural Network
Convolutional Neural Networks (CNN) are advanced deep learning algorithms designed to process image inputs by assigning weights and biases to various objects within those images, enabling effective differentiation between them The preprocessing step required for CNNs is significantly shorter compared to other methods A standard CNN architecture includes a convolutional layer, an activation function, a pooling layer, and a fully connected layer The foundational principles of CNNs have also paved the way for various other image-related applications utilizing neural networks, including image segmentation.
Figure 2.8 CNN architecture for handwriting classification [10]
The convolutional layer performs the convolutional process by applying linearity between two functions to create a third function, illustrating how one influences the other The input image data and a matrix of user-defined weights, known as a filter or kernel, are the two functions involved By using cross multiplication, these functions are combined to produce an output Common filter sizes in Convolutional Neural Networks (CNNs) are typically odd integers, such as 3x3 or 5x5, with the example in Figure 2.9 utilizing a 3x3 matrix.
The filter must always be smaller than the input image and starts in the upper left corner, moving to the bottom right As it traverses the image, the filter multiplies the values of the substrings with the input, summing the results to produce a scalar value with each pass Repeated applications of the filter on the input array yield a two-dimensional output, commonly referred to as a "feature map."
Using a smaller filter than the input image is intentional, as it enables the filter to be applied multiple times at various locations across the input Specifically, the filter is systematically applied from left to right and top to bottom, aligning with the input's spatial dimensions.
The detection filter is designed to identify specific features within an input image, allowing it to recognize that feature at any location throughout the picture, a concept referred to as translational symmetry.
Figure 2.9 Convolution of input image size 28x28 and kernel size 3x3 [11]
Equation 2.1 depicts the mathematical formula for convolution
The input image’s size is I, and the filter size is K (h x w)
Equation 2.2 shows the size of the feature map after convolution with the input picture
𝑠 + 1) (2.2) n, m is the feature map size, h, w is the filter size, p is the padding size, and s is the stride distance
Activation functions are essential mathematical equations that dictate the behavior of neural networks by determining the activation of each neuron based on its significance in the model's predictions They play a crucial role in filtering out high-value noise and transforming linear networks into nonlinear ones Furthermore, activation functions normalize the output values of each neuron within specified ranges, such as 0 to 1 or -1 to 1, enhancing the network's overall performance.
In neural networks, an activation function layer follows each convolutional layer to introduce non-linearity, essential for learning complex mapping functions from data Without an activation function, the network operates linearly, akin to a simple linear regression model, which significantly limits its learning capacity and representation Popular activation functions include Sigmoid, Tanh, and ReLU, each contributing to the network's ability to solve intricate problems.
Figure 2.10 Example of ReLU [11] activation function applied on image size 26x26
Stride refers to the number of pixels a filter moves when traversing an input image matrix during convolution A stride of one means the filter shifts one pixel right and one pixel down, while a stride of two indicates a movement of two pixels The output matrix size may be smaller than the input image size unless padding is applied, which can help maintain the original dimensions.
Pooling is classified into two types: Maximum Pooling and Average Pooling Max Pooling returns the maximum value from the image's pixel regions covered by the kernel
Average Pooling returns the mean of all pixels within the kernel, while Max Pooling focuses on extracting borders by selecting the maximum values Average Pooling captures softer features, making it essential for users to choose the appropriate pooling type based on their specific data requirements.
Pooling is often represented in CNN by a 2x2 window, stride is two, and no padding
In the image segmentation network, the research team favors the use of Max Pooling
Figure 2.11 Max Pooling 2x2 on input image 26x26 creates output 13x13 [11]
To effectively aggregate multi-level features after processing an image through hidden layers, the Fully Connected Layer, or connection layer, serves as an efficient method This involves flattening the two-dimensional image into a column vector, which is then propagated forward to the network's final layer during training and backpropagated during output The model utilizes a SoftMax or Sigmoid function to differentiate between significant and less important features for classification, operating over a set number of training sessions or epochs, with the SoftMax function specifically designed for categorizing multiple items.
Image Segmentation Techniques
Image segmentation enables computers, software, or robots to analyze images by dividing them into distinct regions based on labeled training data While it shares the same objective as object detection—identifying and labeling objects—image segmentation is more complex, as it requires the accurate labeling of each individual pixel within a designated region.
Image segmentation can be categorized into two types: Semantic Segmentation and Instance Segmentation Semantic Segmentation classifies each pixel in an image into a specific class, such as grouping all human figures and the background into distinct classes In contrast, Instance Segmentation assigns a unique instance to each pixel, allowing for the identification of multiple objects within the same category For this project, we have chosen to implement Semantic Segmentation, as it is more efficient in terms of computational cost by classifying similar objects under a single class rather than detecting each individual instance.
Thresholding image segmentation is the simplest form of image segmentation, creating binary or multi-color images by applying a threshold to the original pixel values For instance, with a threshold set at 80, pixels below this value are assigned a value of 0 (black), while those above are set to 255 (white) There are five techniques associated with threshold-based segmentation.
Global thresholding is a technique that relies on bimodal images, which feature two distinct peak intensity points in their intensity distribution—one representing the object and the other the background By identifying these peaks, a global threshold can be established for the image However, it is important to note that this method tends to be less effective in low lighting conditions.
Manual thresholding involves selecting an initial threshold value to segment an image into two regions, G1 and G2 The means of each region are then calculated, and their average (T) is determined This process is repeated, comparing the previous threshold (Ti-1) with the current threshold (Ti) until the error falls below a user-defined predetermined value.
Figure 2.12 Manual thresholding a) original image; b) Background image; c) Image after background subtraction; d) Threshold image [12]
Adaptive thresholding enhances image segmentation in poor lighting by dividing the image into smaller regions, each processed with a locally calculated threshold value This method combines the segmented subregions to improve overall segmentation effectiveness, as illustrated in Figure 2.13.
• Optimal thresholding: The optimal thresholding approach can be used to reduce pixel misclassification caused by segmentation The pixel value probability is calculated by the following equation
• 𝑝 𝑏 (𝑧), 𝑝 𝑜 (𝑧): probability distributions of background, object pixel
• 𝑃 𝑏 (𝑧), 𝑃 𝑜 (𝑧): the a-priori probabilities of background, object pixels
Figure 2.13 Real-time adaptive image thresholding [13]
Local adaptive thresholding addresses the challenges of global thresholding in image segmentation caused by varying illumination By dividing the image into smaller subregions, adaptive thresholding is applied to each region individually After segmenting these regions, they are combined to create a complete segmented image of the original This method utilizes the histogram of subgroups to enhance the accuracy of image segmentation.
Edge-based segmentation utilizes multiple techniques to identify the edges within an image, highlighting the separation between objects and their backgrounds Various edge detection operators are employed to effectively capture the contours of objects, as illustrated in the examples provided (Figure 2.14).
Figure 2.14 Different type of edges a) Step edge; b) Roof edge; c) Ramp edge; d) Line edge
Figure 2.15 Type of edge detector a) Sobel; b) Prewitt; c) Roberts
Figure 2.16 An example of edge-based segmentation a) Original image; b) Edge detection [14]
Clustering is an unsupervised machine learning technique widely employed for image segmentation, with K-Means Clustering being one of the most prominent algorithms in this domain.
The way k-means algorithm works is as follows:
2 Initialize centroids by first shuffling the dataset and then randomly selecting
K data points for the centroids without replacement
3 Keep iterating until there is no change to the centroids i.e., assignment of data points to clusters that is not changing o Compute the sum of the squared distance between data points and all centroids o Assign each data point to the closest cluster (centroid)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 o Compute the centroids for the clusters by taking the average of the all-data points that belong to each cluster
Figure 2.17 Segmentation using K-Mean Clustering applied on an RGB image with different k value a) Origin image; b) Segmented image with k=2; c) Segmented image with k=3
Figure 2.18 Example of K-Mean clustering method applied on 2D data
2.3.4 Artificial Neural Network Based Segmentation
2.3.4.1 Structure of an Image Segmentation Network
Figure 2.19 Image segmentation network according to the encoder/decoder structure
[15] The arrows display the operations of network, red is downsampling, green is upsampling and blue is skip connections
Image segmentation networks utilize convolutional layers with Same Padding to maintain the original image size, allowing early layers to capture low-level features while deeper layers focus on high-level attributes To enhance expressiveness, the number of channels is regularly increased, which can be managed through pooling or stride convolution A popular architecture is the encoder/decoder structure, which downsamples the input image's spatial resolution to generate low-level feature maps This approach has proven effective in distinguishing between objects and subsequently up-sampling these features to create a full-resolution feature map.
Downsampling, primarily achieved through max pooling or average pooling layers, plays a crucial role in convolutional neural networks (CNNs) by reducing the dimensions of feature maps post-convolution This process helps preserve spatial information, even when input images are distorted While downsampling is not mandatory in CNNs, it significantly enhances network performance by generating more robust features.
Figure 2.20 Downsampling in convolution neural network
Figure 2.21 CNNs network with an upsampling layer [16]
Upsampling is a technique used to restore feature maps to their original image resolution, with transpose convolution being the most favored method due to its ability to create learned sampling This process essentially reverses standard convolution by taking a single value from a low-resolution feature map, multiplying it by all the filter weights in the displacement convolution, and then adding the resulting values to the output of the feature map.
In Deep Learning models, the term "Backbone" refers to the feature extractor network that processes input images to compute essential features These features are then either flattened to connect with a fully connected network for classification tasks or upsampled by a decoder module to create segmented masks Commonly used Backbone networks in research include Darknet, ResNet, ResNetX, InceptionV3, XceptionNet, and EfficientNet.
Residual learning is a convolutional neural network that applies skip connections
The primary concept of this network is to utilize short connections with a uniform size matrix to circumvent both single and multilayer convolution operations This design is known as the Residual Block, illustrated in Figure 2.22.
Semantic Segmentation Dataset for Road Scene Application
Figure 2.35.The CamVid dataset contains more than 30 labels that correlate to 32 items such as automobiles, roads, sidewalks, light poles, people, and so on [23]
The CamVid dataset, part of the Cityscape Collection, is a comprehensive open resource for image segmentation, featuring nearly 1,500 images taken from a camera mounted on a moving vehicle in Cambridge, England It includes detailed labels for 32 distinct categories, such as automobiles, roads, light poles, road markings, and pavements, making it a valuable tool for researchers and developers in the field of computer vision.
Since its introduction in 2014, the CamVid dataset has been a significant resource for academics evaluating the performance of segmented networks The mean Intersection over Union (mIoU) scores achieved on the CamVid dataset have shown consistent improvement over time, particularly in real-time applications.
The Cityscapes dataset features a rich collection of street scene images captured by stereo cameras across more than 50 cities, encompassing 30 distinct label classes With over 5,000 high-quality pixel-annotated frames and an additional 20,000 weak annotations, this dataset is designed to enhance segmentation algorithms for pixel-level, instance-level, and panoptic semantic labeling Furthermore, it serves as a valuable resource for research that necessitates extensive data, particularly in training deep neural networks.
Figure 2.36 Cityscapes dataset with 30 semantics label classes [24]
The dataset was collected using a camera mounted on a vehicle traveling through the HCMUTE University campus, specifically in blocks C, D, and A This data, which has a resolution of 640x360 pixels, is labeled using the Hasty.ai platform and features three distinct color-coded labels: blue for roads, red for pedestrians, and purple for cars To improve model learning, various augmentation techniques were applied, including random resizing, rotation, perspective transformation, and color jittering, each with a factor of 0.3.
The dataset comprises 1,299 PNG images with semantic labels depicting road scenes categorized into three classes: road (blue), cars (purple), and pedestrians (red) To enhance the performance of the segmentation network, data was collected during three distinct periods: morning, afternoon, and night, as illustrated in Figure 2.37, which displays the distribution of data gathered throughout the day.
The custom dataset consists of 1,299 images captured by a vehicle-mounted camera on the HCMUTE campus, with labels tagged using the Hasty.ai platform The dataset includes various time blocks: a) Afternoon data from block C, b) Morning data from block C, c) Night data from block A, and d) Afternoon data from block A.
Figure 2.38 Data augmentation methods used in training steps random resize, random rotate, random perspective transform, and color jitter.
Camera Calibration
Camera calibration involves estimating the various parameters of a camera to correct for distortion, especially when using a wide-angle lens like a 130° camera By gathering camera information, we can effectively undistort images through a process known as "pulling." There are two main categories of parameters: internal and external Internal parameters consist of the focal length, distortion coefficients, and the lens center, while external parameters include the rotation and translation matrices that relate the camera to the world coordinates Utilizing these extrinsic parameters allows for the conversion of world coordinates into camera coordinates Subsequently, points from the environment can be accurately projected onto the image plane using the internal parameters, as demonstrated in Equation 2.8, which illustrates the transformation of a point from 3D space to a 2D image plane.
P is a 3x4 Projection matrix consist of two parts: intrinsic matrix (K) and extrinsic matrix (R and t)
K is an upper triangular matrix representing the intrinsic camera parameters, where fx and fy denote the focal lengths along the x and y axes, respectively The optical center of the image plane is defined by the coordinates cx and cy Additionally, γ represents the skew between the axes, which is typically set to 0 in most camera models.
Figure 2.39 Camera calibration Left: original heavily distorted image Right: calibrated image.
DESIGN AND CALCULATION
Design and Propose of the Image Segmentation Model
This thesis proposes an Image Segmentation model that prioritizes high accuracy, low latency, and high frame-per-second performance The author evaluated various models trained on well-known datasets like CamVid and CityScapes, identifying several that meet these criteria, including ENet, ICNet, and Faster-Seg Ultimately, BiseNet was chosen due to its impressive mIOU metric, satisfactory FPS during the research period, and its ease of implementation and deployment to formats such as ONNX, TFLite, and TensorRT, which enhance latency.
Figure 3.1 illustrates the accuracy and speed comparison of BiSeNet against the latest networks using the Cityscapes dataset The evaluation was conducted on an NVIDIA Titan XP with an input resolution of 2048x1024 Notably, the “-” symbol denotes methods that do not provide proportional accuracy data [26].
BiseNet, introduced by Yu et al at ECCV 2018, is an innovative image segmentation network addressing the challenges of self-driving cars While CNNs have excelled in pixel segmentation using backbone networks such as ImageNet, ResNet, and InceptionNet, traditional encoder-decoder architectures struggle with real-time performance due to bottleneck layers that slow down image processing BiseNet overcomes this limitation by splitting the encoder and decoder, allowing them to operate in parallel, thus enhancing efficiency and speed in image segmentation tasks.
Figure 3.2 a) Classical CNN network structure b) UNet network structure c) BiSeNet network structure [26]
Based on the image, the structure of BiSeNet is a combination of traditional CNN network and UNet network with two components: Spatial Path (SP) and Context Path (CP)
As their names suggest, these two pathways have been created to solve the two problems of spatial information loss and narrowing of the field of view
The Spatial Path (SP) is designed to preserve spatial information in deep convolutional neural networks (CNNs), addressing the issue of feature map flattening at the fully connected layer, which often leads to a loss of important spatial details By utilizing a small stride, the SP maintains high resolution, consisting of three convolution blocks with a 3x3 convolution matrix and a stride of 2 Each block is followed by batch normalization and a ReLU activation layer, resulting in an output that is 1/8 the size of the original image Functioning similarly to a UNet encoder, the Spatial Path effectively gathers detailed spatial information through large feature maps, enhancing the model's predictive capabilities.
Figure 3.3 Inspecting the feature map of Spatial Path, a rather shallow Convolutional Network, which later helps restrain high-resolution features
The Context Path (CP) is designed to optimize the receptive field size in image segmentation tasks, which is crucial for improving model accuracy While traditional methods like Pyramid Pooling Module and Atrous Spatial Pyramid Pooling can enlarge the receptive field, they often require substantial computing power and memory, slowing down model performance In contrast, CP leverages lightweight models, such as Xception, along with Global Average Pooling (GAP) to efficiently expand the receptive field while maintaining speed By rapidly down-sampling feature maps, the BiSeNet network captures extensive segmentation information The integration of global context from GAP with features from the lightweight model is achieved through a UNet topology, ensuring a robust and efficient segmentation process.
Figure 3.4 illustrates the feature maps of the Context Path, which includes deep convolution layers that progressively down-sample the feature maps by factors of 1/4, 1/8, 1/16, and 1/32 This down-sampling enhances the receptive field, allowing the model to capture more intricate details However, it is important to note that this process can result in the loss of certain features over time.
The Attention Refinement Module (ARM) enhances feature extraction at every stage by utilizing global average pooling to capture global context and generate centralized vectors for instructing the network in feature learning This innovative design boosts the output characteristics within the Context Path while seamlessly integrating global context information without the need for up-sampling, resulting in reduced computational resource requirements.
The Feature Fusion Module (FFM) integrates features from two distinct pathways that express information at varying levels, making simple addition inadequate The Spatial Path primarily captures detailed spatial information, while the Context Path focuses on encoding contextual information Consequently, the Spatial Path produces low-order output characteristics, whereas the Context Path yields high-order output This disparity necessitates the implementation of a fusion model to effectively combine these features.
To effectively reassemble features represented at various layers, it is essential to integrate the Spatial Path and Context Path features Prioritizing batch normalization helps normalize the input packet and re-balance feature sizes Subsequently, the combined features are organized into a feature vector, allowing for weight computation This weight vector plays a crucial role in adjusting feature weights, thereby enhancing the selection and matching of features.
Figure 3.7 Inspecting the feature map of Feature Fusion Module, features from Spatial Path and Context Path being combined to make more fine-grained output segmentation
- Global Average Pooling (GAP): is a popular idea in CNN network privilege designs
During convolutional network training, overfitting can occur when the model becomes biased towards specific examples rather than general object features, leading to inaccurate predictions The Global Average Pooling (GAP) technique mitigates overfitting by decreasing the number of samples needed for the network to learn, compelling it to focus on essential features Similar to maximum pooling, GAP reduces the size of a feature map by averaging its component values, enhancing the model's ability to generalize.
The network employs an auxiliary loss function to monitor its learning, while the principal loss oversees the overall output of BiSeNet Two specific auxiliary loss functions are utilized to evaluate the output from the Context Path, with all other loss functions being Softmax, as detailed in Equation 3.1 To balance the weights of the principal and auxiliary losses, a parameter α is introduced, set to one in this context This interconnected loss structure simplifies the optimization process of the network.
𝑖 (3.1) where p is the output of the network prediction
The loss function for the Xception network is defined as L(X; W) = LP(X; W) + α ∑(from i=2 to K) li(Xi; W), where LP represents the principal loss of the fusion output, Xi denotes the characteristic output from stage i, and li is the auxiliary loss for stage i In this context, K is set to three, indicating that the network incorporates multiple stages to optimize performance.
L is the union of two loss functions Here, auxiliary loss is only used when training The training data is labeled using Hasty.ai webpage [25]
The BiSeNet network architecture, illustrated in Figure 3.8, consists of two main components: the Context Path and the Spatial Path This overview highlights the feature maps generated at each layer, along with key parameters including data size and the number of channels Additionally, it presents the dimensions of both input and output images, providing a comprehensive understanding of the network's structure and functionality.
Mathematical Analysis of a 4-wheeled Ground Vehicle
This section explores various automobile models to aid in the development of hardware and controller design We will discuss the Ackerman Steering model, along with the 2D Kinematic Bicycle and Dynamic Bicycle models.
Ackerman steering kinematics are fundamental to the design of all two-wheel motor vehicles, ensuring that the vehicle's horizontal projection remains largely unchanged during steering This system is prevalent in most cars today, as it allows each wheel to have its own center of rotation A key principle of Ackerman steering is that the inner and outer wheel axles must be aligned so that their extension lines intersect along the line connecting the rear axle's center of rotation, enabling the wheels to navigate concentric circles effectively.
3.2.2 Analyze the Kinematic Model of Four-wheeled Vehicle
To simplify the model, the team use a reduced kinematics model of the car, which is a shortened form of the Ackerman model, also known as bicycle model (Kinematic bicycle model)
The continuous nonlinear equation describing the model motion in an inertial frame of reference has the form [27]:
In an inertial coordinate system (X, Y), the coordinates of the center of mass are represented by x and y, while ψ denotes the angle of inertia deviation and v indicates the vehicle's speed The distances from the vehicle's center of gravity to the front and rear axles are denoted as lf and lr, respectively The angle of the center of mass relative to the car's longitudinal axis is represented by β, and α signifies the acceleration of the center of mass in the direction of velocity Control inputs include the front steering angle δf and the acceleration α, with the assumption that the rear wheel steering angle δr is zero, as most vehicles do not allow for rear-wheel steering.
The bicycle kinematics model simplifies controller design and navigation algorithms due to its reliance on just two parameters, lf and lr However, this simplicity can lead to significant errors in practical applications when compared to higher fidelity vehicle models.
3.2.3 Analyze the Dynamic Model of Four-wheeled Vehicle
When analyzing vehicle dynamics, it is essential to consider factors such as acceleration, wind force, wheel slip angle, friction, and the mass of the vehicle These elements collectively influence the performance of the car model The motion of the vehicle in a coordinate system (X, Y) can be represented by a specific equation.
In the vehicle's frame of reference, x and y denote the longitudinal and transverse speeds, while ψ indicates the angle of deviation from the destination The vehicle's mass and inertia are represented by m and Iz, respectively Additionally, F c,f and F c,r signify the tire forces acting on the front and rear wheels within the coordinate frame aligned with the wheels.
The force acting on the front wheels is described by the linear tire model as follows:
With ⅈ ∈ {f, r}, α i is the slip angle, C αi is the stiffness coefficient between the wheel (front or rear) and road
Designing of Control System
This article explores the dual closed-loop lateral controllers implemented in our self-driving system One controller utilizes geometric techniques to estimate the vehicle's heading from edge-detected segmented road images, while the other relies on GPS waypoints and IMU measurements Both controllers operate simultaneously within the Python program, but only one can influence the steering controller at any given moment, determined by a system flag that manages their activation.
3.3.1 Waypoint-based Geometric Lateral Controller
To effectively navigate a vehicle along a designated GPS track, we propose multiple solutions for a 2D waypoint-based controller By utilizing the vehicle's state, including position (x, y), yaw heading, and its relative position to the route trajectory, the controller computes the necessary steering angle to direct the vehicle towards its desired position This article will explore several controller options, including the well-established Pure Pursuit and the Stanley controller.
Pure Pursuit is a geometry-based orbital tracking controller designed to guide a vehicle along a predetermined path This controller operates by establishing a forward viewpoint at a fixed distance along the vehicle's reference line, directing the car to reach that point through a calculated steering angle As illustrated in Figure 3.11, the Pure Pursuit controller effectively manages the vehicle's trajectory.
Figure 3.11 Illustration of a pure controller model (Pure Pursuit) [27]
Based on the sine law, Equations 3.6 describe the controller
Where k is the curve of the trajectory, applying the bicycle kinematics model to the above equations
Replace R at (3.6d) The steering angle δ is defined as follows:
The Stanley Controller, utilized by the Stanford University Darpa Grand Challenge team, employs a path tracking method that references the front wheel axle instead of the rear wheel axle, as seen in the Pure Pursuit method This algorithm effectively accounts for both heading error and cross-track error, where cross-track error is defined as the distance from the nearest point on the track to the vehicle's front axle.
Figure 3.12 Stanley self-driving car of Stanford University participated and won the Darpa Grand Challenge [28].
Figure 3.13 Illustration of the Stanley controller model
The steering angle of the vehicle is determined geometrically by the following formula
In the case of a large heading error, yaw at time t will be taken plus the current steering angle to quickly return to the desired position
In case of large cross-sectional error, the expression 𝑡𝑎𝑛 −1 ( 𝑘𝑒(𝑡)
2 to stabilize the steering angle
3.3.2 Image-Segmentation-based Lateral Controller
After obtaining the segmented road from the segmentation model, the next step is to determine an offset position that accounts for the road's shape, contour, and obstacles such as cars and pedestrians The study team introduces the "7-point distance matrix" approach to estimate the predicted offset location for the vehicle This method generates total coordinates (x, y) by identifying intersection points with the road contour from seven lines positioned on both the left and right sides, starting from the vertical center of the frame and separated by an angle of 10 degrees Figure 3.14 illustrates this approach.
Figure 3.14 Average offset point from segmented image; a) original image; b) projection lines intersecting with the road’s contour to compute average offset point
The offset Xoffset or error of the vehicle is calculated from frame to frame with formula 3.10
The AI processes a 7x2 matrix representing intersection point coordinates extracted from the left and right sides of the road contour using Canny edge detection The primary objective is to determine the middle frame coordinates (𝑋 𝑜𝑓𝑓𝑠𝑒𝑡, 𝑌 𝑜𝑓𝑓𝑠𝑒𝑡) to guide the vehicle's steering.
The heading angle 𝛿(ⅈ) toward the offset point can be computed every frame using the following equation:
Figure 3.15 Steering angle estimation from image segmentation
To minimize noise and ensure smooth steering, a PID (Proportional Integral Derivative) controller is employed to reduce the error between the vehicle's current angle and the desired angle This control system consists of three components: proportional (Kp), integral (Ki), and derivative (Kd), which work together to optimize performance.
The proportional output (P) of a model is directly related to the difference between the current output and the desired output, denoted as 𝑒(𝑡) When the error is significantly high and positive, the output will also be correspondingly high and positive, influenced by the gain factor K This proportional response to the error between the setpoint and the actual value can lead to overshoot during the steering process.
• I integrate the previous error value between SP and PV over time to produce
The integral term aims to minimize any remaining residual error after applying the proportional control by incorporating a cumulative value based on the error's history Once the error is eliminated, the growth of the integral term ceases Consequently, while the proportional effect decreases as the error reduces, it is balanced by the contribution of the integral term.
D is most effective for predicting future trends in SP and PV error by analyzing the current rate of change, a method referred to as "anticipatory control." This approach aims to mitigate error impacts by employing a control influence derived from the rate of error change, where a quicker rate of change results in more significant regulating or dampening effects.
In continuous time, the overall control function:
In discrete time, the equation follows below :
𝐾 ⅆ : Derivative gain e: error between target input and feedback value of the system t: Time or instantaneous time
𝜏: Variable of integration takes on values from time 0 to the present t
From MATLAB simulation we find that by adjusting the PID parameter Kp, Ki, Kd the following characteristic of the system will change accordingly
Response Rise time Overshoot Settling time ess
KP Decrease Increase NT Decrease
KI Decrease Increase Increase Eliminate
Table 3.1 Change of system based on PID parameter
Rise time is the time system requires to change from specific low setpoint to a specific high setpoint This range can be from 10% to 90% of output value
Overshoot is the phenomenal when signal value exceeds its setpoint
Settling time is the time required for an output to attain and maintain within a specific error band following an input stimulation
Settling error is the margin of error compares to the setpoint after the system has achieve stability
In this project, the team implemented all three PID controller parameters—Kp, Ki, and Kd—to optimize response during narrow turns and ensure stabilization The block diagram illustrating the PID controller for vehicle angle within a feedback loop is shown in Figure 3.16.
Figure 3.16 Block diagram of PID controller for vehicle angle
The steering angle of an autonomous vehicle is calculated using a PID model, which assesses the total offset between seven pairs of intersections and the road contour If the vehicle deviates by steering left or right, the error value significantly increases, as the image segmentation network may fail to detect the sidewalk, causing the road to dominate the frame The PID model utilizes this offset to adjust the proportional (Kp), integral (Ki), and derivative (Kd) parameters, ensuring timely and safe steering of the servo.
Figure 3.17 Block diagram of PID for steering angle.
Preparing of GPS Road Map
OpenStreetMap (OSM) is a collaborative project designed to create a free and editable global geographic database, with its primary product being the geodata that supports the maps The initiative was launched in response to limitations on map data accessibility worldwide and the rise of affordable portable satellite navigation systems OSM data can be applied in numerous ways, such as generating print and digital maps, geocoding addresses and place names, and planning routes.
This project utilizes OpenStreetMap (OSM) to develop a ground truth map for GPS tracking, focusing on the road network surrounding the HCMUTE school campus The map extends from block D, passes through block A, and concludes at block F Figure 3.18 illustrates the process of extracting waypoints from the road map data obtained from OSM.
Figure 3.18 OpenStreetMap road map a) HCMUTE road waypoints; b) Extracted waypoints
To convert from a global frame (Lat, Lon) to a simple plane (X, Y) for easier calculation, we adapt the method of Equirectangular projection
Due to the scarcity of GPS waypoints, adapting the waypoint controller has proven challenging To address this issue, we opted to implement Linear Interpolation for GPS coordinate interpolation.
Figure 3.19 An example of Linear Interpolation method
Figure 3.20 GPS waypoints interpolated using our method a) Original waypoints contain
400 points; b) Interpolated waypoints with 4000 waypoints.
Extended Kalman Filter
The Extended Kalman Filter (EKF) is a nonlinear variation of the Kalman filter that linearizes around estimates of the current mean and covariance It is widely regarded as the standard estimation method in tracking systems like GPS due to its unbiased nature Unlike traditional Kalman filters, the EKF allows for state transition and observation models to be differentiable functions rather than strictly linear The system dynamics utilizing the EKF evolve according to specific process equations.
𝑥 𝑡 = 𝑓(𝑥 𝑡−1 , 𝑢 𝑡−1 , 𝑤 𝑡 ) (3.14) With xt consists of estimated states, ut is system control input and wt is the process noise In this project the vector xt has the following shape a) b)
Equation 3.15 models the measurement of parameters In which zt is the measurement vector, the shape of this vector depends on number of measurements and variables, vt is the measurement noise
In navigation systems, signals are inherently non-linear, necessitating the use of the first order of the Taylor series to estimate non-linear processes and measurements The accuracy of the Extended Kalman Filter (EKF) output improves as the Taylor estimation approaches the working point, as illustrated in figure 3.21.
The Jacobian matrices A and W represent the partial derivatives of the function 𝑓 with respect to variables x and w, while H and V denote the Jacobian matrices of the function ℎ concerning z and v The Kalman filter operates in two main phases: prediction and correction The standard Kalman prediction state is outlined in this context.
Pt is the covariance matrix linked to the prediction of state vector parameters, while
Qt is the process noise covariance matrix The measurement update formular are as follows
Relationship between state variables of autonomous vehicle is shown in the following dynamic model
Figure 3.22 Uncertainty after 2260 timesteps derived from equation 𝑃̂ 𝑡 (3.18) with unit x, y (m), yaw angle (rad), speed (m/s)
Figure 3.23 Example of position (x, y) displacement after EKF The distance is calculated in meter unit
The experiment took place on a specific GPS road track within our school campus, utilizing a diagonal initial covariance matrix valued at 500 Key parameters included a maximum velocity of 9 m/s, a turn rate of 0.05 rad/s, and an acceleration of 3 m/s² The uncertainty derived from the covariance matrix \( \hat{P}_t \) is illustrated in Figure 3.22.
The flowchart for real time EKF is shown in figure 3.24 below
Figure 3.24 Block diagram of real time EKF.
Fusion Strategy between Segmentation-based Controller and GPS-waypoint Controller
Figure 3.25 The flowchart describes the fusion strategy
The image-segmentation approach effectively estimates heading positions while avoiding unbounded objects, making it suitable for short-term local planning In contrast, the GPS-waypoint controller navigates routes based on predetermined waypoints but fails to account for obstacles such as vehicles and pedestrians.
The author utilizes a flag-based technique to control the vehicle, relying solely on GPS, IMU, and an RGB camera due to limited sensor availability This method involves scanning the road's contour and assigning GPS flags at critical points such as intersections or challenging curves Once a flag is activated, the system can select the most suitable controllers for the specific driving conditions.
HARDWARE ASSEMBLY
Overall
The 1/10 scale self-driving vehicle model features a 1/10 size car equipped with a 2K 130 0 camera and a Jetson TX2 embedded computer It includes two batteries: a 3 Cell LiPo 5000mAH and a 2 Cell LiPo 5500mAH, along with a 16-pin servo expansion driver circuit (PCA-9685), a 60A ESC motor controller, a brushed motor (RC-540), a servo (MG996R), an RX601 receiver, and an Arduino Nano for enhanced control and functionality.
The redesigned model features significant enhancements, including a wider-angle camera, optimized rear wheel positioning for better balance with rear-mounted devices, and a complete rewiring of circuits to fit within the vehicle Additionally, a 3mm thick mica layer serves as an exterior protective shield The accompanying figure illustrates the control mechanisms for speed and steering in the self-driving vehicle.
Figure 4.1 Block diagram of self-driving vehicle.
Hardware
The 1/10 scale remote control racing car utilizes a 2.4 GHz frequency for radio communication, enabling remote operation from distances of up to 150 meters This vehicle features steering wheel controls, allowing for a driving experience akin to that of a real car, even when used autonomously.
Some information from the manufacturer about this car:
- Motor used: RC-540 with brush
- The servo motor used can withstand a torsion force of about 6kg
- Suitable battery type: Li-po 7.2V 2200mAhh
- Maximum distance that this vehicle can receive radio waves: 150 meters
- Vehicle length: 465 mm, Vehicle width: 247 mm, Vehicle height 205 mm
Figure 4.2 Autonomous RC car model
The brushless motor RC-540 used in the car is a Micro Motor Information from the manufacturer about the home motor is as follows:
The MG996R is an upgraded variant of the MG995 engine, featuring enhanced power and improved precision in steering angle control due to its new PCB and IC designs.
- Torque force: 9.4kg/cm (4.8V); 11kg/cm
- The current of the servo operating at no load is 170Ma
Figure 4.5 shows the Arduino Nano V3.0 Atmega328P Board controlling self- driving car model
The Arduino Nano V3 is a compact and powerful circuit board featuring the Atmega328 microcontroller, offering similar functionalities to the Arduino Uno R3 Both boards share the same core architecture; however, the Arduino Nano V3 distinguishes itself by including two additional ADC signal read pins, A6 and A7.
- Digital I/O pins: 4 (6 pins capable of outputting PWM signals)
- Current per input/output pin: 40mA
- 16 kB memory (Atmega168), 32 kB (Atmega328) memory of which 2kB is used to load 1 KB (Atmega168) or 2 kB (Atmega328) SRAM bootloader
- 512 bytes of EEPROM (Atmega168) or 1 kB (Atmega328)
The camera, featuring a 2K resolution, a 130-degree wide field of view, and a frame rate of 30 fps, connects seamlessly to the Jetson board via USB This connectivity allows for easy integration with circuit boards like the Jetson Nano and Raspberry Pi, enabling high-quality photo capture for processing.
The PCA 9685 Servo driver allows for precise control of multiple servos, surpassing the limitations of controlling just one servo with standard methods when steering a car This driver interfaces with the Jetson NVIDIA TX2 via the I2C protocol, enhancing accuracy and performance A pinout diagram for the PCA 9685 is provided below.
AI is rapidly emerging as a leading technology trend, driving the creation of numerous innovative products designed to address complex challenges A prime example is the NVIDIA TX2, which, despite its compact size, excels in managing extensive tasks such as executing multiple artificial networks simultaneously This capability significantly enhances algorithm processing speed, enabling efficient object recognition, mathematical problem-solving, image segmentation, and speech recognition.
CPU Dual-core Denver 2 64-bit CPU + Quad-core
- Camera: 12 lanes MIPI CSI-2, D-PHY 1.2
- USB: USB 3.0 + USB 2.0 (Micro USB)
- Mother board: 170.2mm x170.2mm x15.6mm
Table 4.1 Specifications of Jetson TX2
Below is the NVIDIA TX2 board image
Figure 4.8 NVIDIA JETSON TX2 board
The OBC OLAX U80 4G USB Dongle features a high-speed 4G network capability of up to 150Mbps, enabling Wi-Fi broadcasting for up to 10 devices simultaneously This versatile device is perfect for personal use, workgroups, problem-solving, decorating, or even in small vehicles.
RX601 is used to receive signal from DEVO7 to control steering and throttle of RC car
- Compatible with controller (Tx): DEVO
Self-driving Car Model
The self-driving RC car model, developed by a student, features a servo mechanism connected to the steering wheels, allowing for controlled turns to maintain the vehicle's path along the road's center A stationary camera mounted on the car's roof captures road images for processing, while a phone holder enables the streaming of GPS and IMU signals for navigation.
Figure 4.11 The Autonomous self-driving RC Car model
Figure 4.12 Hardware connection of the project.