Luận văn 3d multi object tracking in lidar point cloud

• Chapter 2 covers key techniques and metrics used in LiDAR-based 3Dobject tracking, including deep learning training pipelines, optimizationalgorithms, and Convolutional Neural Networks

Motivation

In the modern world of fast-developing technologies, the ability to track and an- alyze the motion of several objects in 3D space has turned into a very important feature The topic "3D multi-object tracking in LiDAR Point Cloud"[1] is not so unfamiliar any more It goes into advanced methods and technologies that are supposed to allow precise tracking of multiple objects using only LiDAR point clouds, which are datasets created by LiDAR (Light Detection and Ranging) sensors.

Figure 1.1: An example of tracking by LiDAR[2]

LiDAR has evolved into one of the most core and highly visible technologies since its introduction into 3D object tracking, as it fills many critical market needs for highly resolved, spatially accurate data LiDAR works by firing laser beams at targets and determining how long those beams take to bounce back from objects The data from these laser beams then contribute to creating dense 3D point clouds that represent the surrounding environment in great detail In- deed, such point clouds are a key component in the identification and tracking of vehicles, pedestrians, and obstacles in dynamic driving scenarios The goal of this research is to improve the quality of object tracking based on LiDAR point clouds, which is an important contributor to improved safety, operational efficiency, and adaptability in dynamic environments within the current and future autonomous vehicle and transportation systems.

Goals

The core goal of this research is to develop and refine 3D multi-object tracking algorithms using only LiDAR point clouds First, there is a need for efficient processing and filtering of LiDAR data to ensure its accuracy for analysis Li- DAR point cloud data often contains noise, gaps, or anomalies that need to be addressed before it can effectively be utilized for object tracking.

The other main objective is to come up with object detection algorithms strong enough to accurately outline and keep track of all kinds of objects using 3D point cloud data Most of the problems in this aspect will be the differentiation between different objects based on shape, size, and pattern of motion.

Of importance will be the development of an algorithm that can track multiple objects simultaneously The development of the algorithm will be important in considering a number of challenges in tracking multi-objects that move through dynamic environments; for example, practical applications include congestion traffic or making decisions in real time The tracked systems will also need to detect not only the current status of an object, but also the future movements, considering past trajectories—a very vital issue related to several real-time applications in self-driving cars.

The study also ensures that the system performs well in real time The autonomous vehicle and traffic management systems require ultra-low latency re- sponses to perform efficient traffic management on the road Therefore, this research will focus on optimizing the computational efficiency of tracking algorithms, including developing parallel processing techniques, to utilize high- performance computing resources for fast and accurate processing of huge volumes of LiDAR data.

The final step of the research will involve the evaluation of the proposed tracking algorithms in terms of performance on benchmark LiDAR datasets Accuracy, robustness, and speed in processing will be under consideration, toward realiz- ing a system that would easily be incorporated into autonomous driving Such a system is envisioned to give very high returns in improving vehicle safety,navigation, and decision-making within complicated environments.

Scope

This research is about 3D multi-object motion tracking of objects in dynamic environments using LiDAR point cloud The main subsection of the project involves the development of algorithms that can detect, track, At the same time, it is possible to predict the movements of many objects, such as vehicles and pedestrians, using only data collected by the LiDAR sensor.

The research scope of this thesis includes the entire LiDAR data processing process, starting from data collection and preprocessing LiDAR sensors, after using laser beams, can create point clouds that represent the 3D structure of the environment, but these point clouds often require significant preprocessing to remove noise, fill gaps, and improve accuracy This research will initially focus on methods to clean and segment LiDAR point clouds, allowing the system to accurately identify individual objects in the environment.

Once the point clouds are processed, the next step is object detection and classification The ability to identify and classify different objects based solely on their spatial characteristics is crucial for reliable tracking To minimize complexity and also build from baseline, vehicle detection will be through the PointR- CNN model The next challenge is to track multiple objects at the same time in real time This includes linking detected objects to their respective orbits over time, ensuring that the system can track each object’s movement even when they are obscured or temporarily lost from sight.

Tracking systems must be able to handle dynamic environments where moving objects are extremely complex, unpredictable, and interact with each other In today’s climate, this is especially difficult in real-world driving situations where traffic flows are complex, such as multi-vehicle intersections or turns, and ac- cidents can occur sudden change in the motion of an object Systems must not only detect and track objects but also predict their future behavior to support decision-making in autonomous systems.

Thesis structure

There are 5 chapters in this capstone project:

• Chapter 1 introduces the importance of 3D multi-object tracking usingLiDAR point clouds for automated driving and transportation systems in today’s digital age, and briefly introduces how it works of LiDAR and highlights its potential to improve safety, efficiency and real-time decision making The research aims to develop advanced algorithms for processing, detecting and tracking objects in dynamic environments, focusing exclusively on LiDAR data.

• Chapter 2 covers key techniques and metrics used in LiDAR-based 3D object tracking, including deep learning training pipelines, optimization algorithms, and Convolutional Neural Networks (CNN) for data processing Point cloud data It also discusses performance metrics such as IoU, MOTA, and MOTP to evaluate tracking accuracy and consistency In particular, this chapter also mentions two main algorithms in media tracking, which are Particle Filter-Based Tracking and Particle Filter-Based Track- ing, and compares specific aspects of these two algorithms.

• Chapter 3examines various advanced approaches to 3D multi-object tracking However, to be suitable for evaluating results, these methods will have something in common: they are all evaluated using the KITTI data set.

• Chapter 4 delves into proposing a new model in the problem of 3D multi- object tracking (MOT) This chapter highlights the tracking process The performance of the proposed method is evaluated on three categories (cars, cyclists, pedestrians), demonstrating competitive tracking accuracy and improved localization accuracy in a number of figures, although there are some minor differences in other figures.

• Chapter 5 highlights the high accuracy, congestion handling, and tracking capabilities of the proposed model Additionally, the model has many limitations including missed detection of small or fast-moving objects and inconsistent tracking in complex situations Future improvements with the hope of increasing metrics will be the focus.

In Chapter 2, it presents about preliminary knowledge in this project.

Deep Learning

Deep learning, a subcategory of machine learning, has recently emerged as one of the most influential and transformative technologies in the field of artificial intelligence It focuses on the usage of a particular kind of artificial neural network known as deep neural networks, which are composed of multiple layers capable of learning and extracting hierarchical features from the raw data automatically Deep learning algorithms perform well while solving complex problems by directly learning from a dataset Consequently, they yield extremely good performance on challenging tasks.

Deep learning has advanced a number of fields considerably since its emer- gence, and it is especially impactful in the areas of computer vision and autonomous driving Methods in this subfield are an important building block for understanding sensor data like LiDAR point clouds; tasks range from object classification and segmentation to multi-object tracking.

Neural Networks

Neural networks are one of the major building blocks of deep learning, taking inspiration from the way the human brain works The core of a neural network consists of a series of interwoven layers of nodes or "neurons" that convert an input into a representation, which then becomes the output Each neuron receives some inputs, performs a computation based on those inputs with the help of an activation function, and sends the subsequent output to subsequent layers until the final prediction is made.

Typically, a neural network is made up of three kinds of layers: the input layer, which receives the initial data; one or more hidden layers, where intricate patterns and features are learned; and the output layer, which delivers the final predictions or classifications The depth of neural networks can range from shallow structures with only a few layers to deep networks that contain many layers, allowing them to capture highly nonlinear relationships within the data.

In the context of 3D object tracking, neural networks do a very critical job of manipulating the automatic learning and extraction of key features within Li-

DAR point clouds to facilitate object detection, classification, and tracking with remarkable accuracy The key strength in neural networks is their ability to generalize from data; hence, they are effective in real-world applications, such as autonomous driving, where the environment is dynamic and the input data high in volume and complexity.

Deep Neural Networks (DNNs): Deep neural networks are those neural networks containing several hidden layers between the input and output layers, a fact which distinguishes them from other types of neural networks Due to their architecture, they are able to model highly complex patterns within data While shallow neural networks can contain just one or two hidden layers, DNNs contain upwards of dozens, or even hundreds, of layers That allows them to learn features in a progressively abstract manner as data goes through the network.This depth provides DNNs with the ability to extract hierarchical presentations from raw data automatically As a result, the areas where DNNs work effectively are tasks that require enormous volumes of unstructured data, including image and speech recognition, natural language processing, and 3D object detection.

In the domain of LiDAR-based multi-object tracking, DNNs play a crucial role in processing large, sparse point cloud data, a task where traditional approaches often fall short These DNNs will learn complex relations of 3D objects with respect to their environment and can be used for object detection, classification, and tracking in real time for a dynamic, noisy, and partially occluded environment Their adaptability and generalization in huge data make them a necessity in robust and efficient tracking systems for autonomous driving applications.

Training Process

This step of training is an important task involved in the building of deep learning models, whereby the neural network is gradually trained on the data with respect to precise prediction or classification It includes random weight initialization of a network; a process called backpropagation adjusts the weights subsequently So, in more detail, while training, input/observed data should be fed, that is, the training data, with their corresponding associated outputs being created from it The loss function indicates how far the predicted output has to be from the true label; common ones include Mean Squared Error for regression tasks and Cross-Entropy for classification tasks Training aims to minimize this error by changing the weights of the network such that it can make better predictions This adjustment is done through an optimization algorithm, which can be a simple one like Stochastic Gradient Descent or more advanced ones likeAdam It iteratively adjusts the weights in the direction that reduces the loss The network is trained in this manner several times in epochs, using batches of data,which gradually allows it to learn and refine its parameters During this time, the network also becomes more adept at discerning patterns or relationships in the data and makes it capable to handle examples the network has never seen By training regarding LiDAR-based tracking of 3D objects, the model learns how to adaptively learn and track objects’ detection from the data of a point cloud by adapting to every complexity and changeability of an environment.

Optimization Algorithms: The core of any training process in deep learning is optimization algorithms, which play a vital role in minimizing the loss function for improving the accuracy of a model These algorithms adjust the weights and biases of a neural network to reduce the discrepancy between predicted and actual outputs This mainly applies to the use of the optimization algorithm: Stochastic Gradient Descent is by far the most widely used, where weights update themselves by calculation of the gradient of the loss with respect to parameters It will show the direction to modify the weights so as to minimize the loss However, vanilla SGD usually converges slowly and can be inefficient in escaping from complex, high-dimensional loss landscapes; it often faces se- rious problems such as slow convergence or being stuck in a local minimum.

In this regard, several advanced optimization algorithms have been developed. Momentum introduces a friction term that helps the algorithm maintain its direction and speed to converge better Probably the most popular optimizer is Adam, which combines the merits of both momentum and adaptive learning rates Adam computes the adaptive learning rates for each parameter based on first and second moments of gradients; this allows it to converge way faster and be more efficient, especially when data is sparse or noisy In LiDAR-based 3D object tracking, where datasets are big and the learning task is complicated, optimization algorithms such as Adam prove to be very useful; they allow for faster and more stable training while helping the models generalize better on unseen data.

Convolutional Neural Networks

Convolutional Neural Networks are a special kind of neural network designed to be exceptionally good at processing data that has a grid-like topology: for example, images, or 3D point clouds Drawing inspiration from the way the human brain processes visual information, CNNs automatically and adaptively learn spatial hierarchies of features by applying convolutional layers In every convolutional neural network, each convolutional layer is allowed to use a set of learnable filters, or kernels, scanning through the input data and performing a convolution to detect low-level features such as edges, textures, and other patterns These features, in turn, are combined in even deeper layers, forming more abstract representations, such as object parts and even whole objects Be- cause they greatly reduce the amount of feature preprocessing, CNNs are really effective in such applications as image classification, object detection, and segmentation by learning complex patterns directly from raw data In the context ofLiDAR-based 3D object tracking, CNNs are able to process point clouds either using 3D convolutions or by projecting the point cloud data into voxel grids or depth maps Thus, it enables detecting and tracking objects in three-dimensional space Due to their ability to capture spatial relations among the data in a computationally efficient way, they found their place within traditional image-based applications and more complex tasks such as autonomous driving or robot per- ception.

Figure 2.3: Convolutional Neural Networks architecture[5]

Convolutional Layers:The building blocks of CNNs are the convolutional layers, which are designed to work with spatial data Unlike the fully connected layer, where each neuron is connected to every neuron of the previous layer, a convolutional layer uses a set of learnable filters (also called kernels) sliding over the input data performing convolution These are, in turn, supposed to detect the local pattern-edging, textures, or corners-only by performing a math- ematical operation between these filters and the local regions of the input data.

It produces a feature map that signals, for a set of features across the different spatial locations, the presence The convolutional layers will eventually make the network invariant to small translations of objects; that is, the network can recognize features regardless of their position in space Besides the standard 2D convolutions used for image data, 3D convolutions are very commonly used in tasks with 3D data, such as point clouds or volumetric images, to capture relationships in space across multiple dimensions Some of the key advantages of convolutional layers are: parameter sharing reduces the overall number of parameters in the network and brings a regularizing effect that helps to avoid over- fitting; the capability for hierarchical feature representation, meaning the ability to learn complex patterns and details at higher levels of abstraction, makesCNNs very effective in many tasks such as object detection and tracking where spatial relationships between objects are essential.

Basic metrics

Intersection of Union

Intersection over Union[6] (IoU) is a widely used metric for quantifying the spatial overlap between predicted and ground truth bounding boxes in object detection and tracking tasks It is calculated as the ratio of the area of intersection between the two bounding boxes to the area of their union, as shown in the following formula:

The IoU score ranges from 0 to 1, where a value of 0 indicates no overlap, and a value of 1 indicates perfect overlap Higher IoU values signify more accurate localization and improved tracking performance, while lower IoU values can indicate poor localization or ineffective tracking In many object tracking applications, an IoU threshold is often applied to determine whether a detection is considered correct Common thresholds include values such as 0.5 or 0.7,depending on the specific requirements of the application.

Multiple Object Tracking Accuracy

Multiple Object Tracking Accuracy (MOTA) is a comprehensive metric used to evaluate the overall performance of a multi-object tracking system It combines various aspects of tracking, including detection errors and identity mismatches, to provide a single score that reflects the accuracy and consistency of the tracking algorithm MOTA is commonly used alongside other metrics to give a more complete picture of the system’s effectiveness.

The formula for MOTA is:

MOTA = 1 − FP+FN num +IDS gt

• False Positives (FP): The number of incorrect object detections (i.e., objects detected that do not exist in the ground truth).

• False Negatives (FN): The number of missed detections (i.e., objects that exist in the ground truth but were not tracked).

• Identity Switches (IDS): The number of times an object is assigned a different identity during tracking.

• numgt is the total number of correctly matched objects across all frames. Interpretation:

• Higher MOTA values signify better overall tracking accuracy, with fewer false positives, false negatives, and identity switches.

• MOTA provides insight into both the localization performance of the tracker and its ability to maintain consistent object identities over time.

Average Multiple Object Tracking Accuracy

Average Multiple Object Tracking Accuracy (AMOTA) is an evaluation metric used in multi-object tracking (MOT) to provide a comprehensive assessment of tracking performance across multiple objects By averaging MOTA (Multiple Object Tracking Accuracy) over a range of recall values, AMOTA gives a more holistic view of how well the algorithm tracks objects over time, accounting for both precision and recall.

The formula for AMOTA is:

L, 2 L , ,1}(1 − FP r +FN num r +IDS r gt )

• Lis the number of recall values (confidence thresholds for integration).

• FPr,FNr, andIDSr are the number of false positives, false negatives, and identity switches at a specific recall valuer.

• num gt is the number of ground truth objects in all frames.

AMOTA thus reflects how well a tracking algorithm performs in terms of minimizing false positives, false negatives, and identity switches while maintaining high accuracy across a variety of recall thresholds.

Scaled Average Multiple Object Tracking Accuracy 15

Scaled Average Multiple Object Tracking Accuracy (sAMOTA) is a performance metric used in multi-object tracking (MOT) tasks It is a scaled version of the Area under MOTA over Recall (AMOTA) metric The scaling ensures that the sAMOTA values range from 0% to 100%, making the metric more in- terpretable The formula for sAMOTA is expressed as: sAMOTA = 1 L Σ r∈{ 1

• Lis the number of recall values (confidence thresholds for integration).

• sMOTAr is the scaled MOTA at a specific recall valuer.

This metric is particularly useful in evaluating tracking systems by accounting for both precision and recall across different confidence thresholds.

Multiple Object Tracking Precision

Multiple Object Tracking Precision (MOTP) is a metric used to evaluate the accuracy of object localization in multi-object tracking scenarios It measures how closely the predicted locations of tracked objects align with their corresponding ground truth positions, providing insight into the precision of the tracking system in positioning objects.

The formula for MOTP is:

MOT P = Sum of Localization Precisions

Number of Correctly Localized Objects Where:

• Localization Error: The distance (or error) between the predicted location (bounding box) of an object and its corresponding ground truth location, often calculated as the center-point distance or using Intersection over Union (IoU) for spatial accuracy.

• Correctly Tracked Objects: The number of objects that are correctly matched between the predictions and ground truth.

• Higher MOTP values indicate greater precision in object localization, meaning the predicted locations closely match the ground truth.

• MOTP is particularly sensitive to the accuracy of predicted bounding boxes,focusing on how well the model positions objects in relation to the ground truth.

Average Multiple Object Tracking Precision

Average Multiple Object Tracking Precision (AMOTP) is an evaluation metric used in multi-object tracking (MOT) to assess the precision of the predicted object locations in relation to their ground truth positions It provides an average measure of the localization error across all tracked objects and frames, offering insight into how accurately the system predicts the positions of the objects.AMOTP focuses on the spatial accuracy of the tracking, as opposed to metrics like MOTA, which measure tracking consistency over time.

The formula for AMOTP is:

• L is the number of recall values (representing different confidence thresholds for integration).

• Distance i,r is the distance between the predicted and ground truth bounding boxes for objecti at recall valuer, usually measured using Intersection over Union (IoU) or another spatial metric.

• num gt is the total number of correctly matched objects across all frames.

Identity Switches

Identity Switches (IDS) refer to occurrences where a tracking algorithm mistakenly assigns the wrong identity to a tracked object, resulting in the object being associated with a different identity during the tracking process This error typically occurs when the algorithm fails to maintain consistent identity assignment across frames.

The calculation of IDS is straightforward:

IDS=Number of times an object is incorrectly assigned a new identity

• Lower IDS values reflect better tracking performance, as the algorithm suc- cessfully maintains the correct identities of objects across frames.

• High IDS values indicate challenges in identity preservation, suggesting that the algorithm struggles to consistently track the same object without confusion.

Fragmentation

Fragmentation (FRAG) is a metric used to quantify the degree to which an object’s tracking trajectory is broken into smaller, disjointed segments It measures how often the tracking system loses track of an object and then reacquires it, leading to interruptions in the object’s continuous trajectory.

The formula for calculating FRAG is:

FRAG = Total number of fragments

A fragment is defined as a part of a trajectory where the object is correctly tracked between two consecutive frames, and a break occurs when the object is not tracked in one or more frames.

• Lower FRAG values indicate smoother, more cohesive trajectories where the objects are continuously tracked without interruption.

• Higher FRAG values suggest that the tracking algorithm frequently loses track of objects, resulting in fragmented and less smooth trajectories.

PointRCNN: 3D Object Detection with Point Clouds

Overview of PointRCNN

PointRCNN is a two-stage network designed for 3D object detection In the first stage, it generates proposals for object bounding boxes, and in the second stage, it refines these proposals by considering the 3D context of the points The key innovation of PointRCNN lies in its ability to operate directly on raw 3D point clouds Traditional methods, including many that rely on 2D projections, struggle to capture the full 3D spatial relationships and fine-grained details of the environment PointRCNN overcomes this limitation by directly processing the 3D point cloud, achieving state-of-the-art results in terms of accuracy and computational efficiency.

The core architecture of PointRCNN consists of two major modules:

• Point-wise feature extraction: This module processes the raw point cloud to extract features at the individual point level.

• Region Proposal Network (RPN): After extracting point-wise features, the network generates object proposals by sampling the point cloud and predicting candidate bounding boxes.

Point-wise Feature Extraction

The first step in the PointRCNN pipeline is to extract useful features from the 3D point cloud Point clouds are typically sparse, unordered, and unstructured data, which poses a challenge for traditional deep learning models that rely on structured grid data like images To address this, PointRCNN uses a PointNet- based architecture for point-wise feature extraction.

PointNet [1] is a pioneering model that can operate directly on raw point clouds by using symmetric functions such as max-pooling to aggregate information from all points, making the model invariant to point order PointRCNN leverages this architecture by employing a PointNet++-based approach, which adapts PointNet to process locally grouped points, capturing finer details of local point cloud structures In this step, each point in the cloud is assigned a feature vector that encodes both its individual spatial position and its local context within the point cloud.

This feature extraction process allows PointRCNN to build a rich, per-point feature representation, which is crucial for detecting and localizing objects accurately in 3D space By learning high-level features from the raw point cloud,PointRCNN can represent complex object structures, even in the presence of occlusions or partial scans.

Region Proposal Network

The second stage of PointRCNN involves generating object proposals from the extracted point-wise features This step is implemented using a Region Proposal Network (RPN), a concept borrowed from the 2D object detection literature (e.g., Faster R-CNN) RPNs work by sliding a small network over the feature map, predicting whether each sliding window contains an object and, if so, the coordinates of a potential bounding box.

In PointRCNN, the RPN is designed to work directly with the 3D point cloud and the feature maps generated in the first stage Instead of relying on a fixed grid like in 2D object detection, PointRCNN’s RPN operates on the point cloud’s inherent 3D geometry The RPN samples points within the cloud, clusters them into potential object proposals, and predicts the 3D bounding boxes corresponding to these objects These proposals are further refined in the subsequent stages of the model.

The RPN produces bounding box proposals that are used to localize objects within the point cloud It predicts the center and size of the bounding boxes, as well as the orientation and class of the objects The output of the RPN is a set of object proposals, which are then refined by the next stages of PointRCNN.

Proposal Refinement

Once the RPN generates the object proposals, the next step is to refine these bounding box candidates to improve accuracy In this stage, PointRCNN uses a second-stage refinement network, which focuses on improving the localization and classification of the objects detected in the first stage.

• Further classification: The second stage re-assesses each proposed object’s class (e.g., car, pedestrian) based on the points within the bounding box, refining the class predictions from the RPN stage.

• Bounding box regression: PointRCNN refines the center coordinates, orientation, and dimensions of each bounding box This ensures that the final object localization is as accurate as possible, taking into account the 3D context of the scene and the shape of the objects.

The refinement network operates on a combination of features extracted from the region proposals and the original point cloud, leveraging the spatial relationships between the points to improve the detection results This results in a more precise localization of the detected objects and ensures that the final bounding boxes are more tightly aligned with the objects in the scene.

End-to-End Training and Loss Functions

PointRCNN is trained end-to-end, meaning that all components of the network are optimized simultaneously during training The model uses a multi-task loss function that combines classification and regression losses to train the object detection system.

The primary components of the loss function are:

• Classification loss: This loss measures the accuracy of the predicted object class for each proposal It typically uses a cross-entropy loss for class prediction.

• Localization loss: This loss measures the accuracy of the bounding box predictions, including the center coordinates, size, and orientation It uses a smooth L1 loss to penalize the difference between predicted and ground truth bounding boxes.

The model is trained using a hard negative mining strategy, which prioritizes learning from the most difficult negative examples (i.e., proposals that are con- fidently classified as background) This ensures that the model learns to correctly distinguish between objects and non-objects, improving overall detection accuracy.

Advantages of PointRCNN

PointRCNN offers several key advantages over traditional 3D object detection methods:

• Direct use of raw point clouds: Unlike methods that convert point clouds into voxel grids or 2D projections, PointRCNN works directly with the original point cloud data, avoiding information loss during preprocessing.

• High precision and recall: By generating object proposals and refining them in two stages, PointRCNN achieves a high detection accuracy, particularly for challenging scenarios like occlusions and partial scans.

• End-to-end optimization: PointRCNN’s end-to-end training allows the model to learn both local and global features, ensuring that both object detection and localization are optimized jointly.

Applications and Use Cases

PointRCNN is widely used in various autonomous driving applications, including:

• Object detection in autonomous vehicles: PointRCNN enables precise localization of objects like cars, pedestrians, and cyclists using LiDAR point clouds, which is crucial for navigation and collision avoidance.

• Robotic mapping and navigation: Robots equipped with LiDAR sensors can use PointRCNN to detect and avoid obstacles, as well as to build 3D maps of their environment.

• Urban planning and monitoring: PointRCNN can be used in applications involving urban modeling, where detailed 3D object detection is necessary for building digital twins of cities.

PointRCNN Output and Integration with 3D MOT 23

After running the PointRCNN model, the primary output consists of the 3D bounding boxes and associated metadata for each detected object in the point cloud These outputs are crucial for the subsequent tracking process in 3D Mul- tiple Object Tracking (3D MOT) The key elements of the PointRCNN output include:

• 3D Bounding Boxes: For each object detected in the scene, PointRCNN generates a 3D bounding box that encapsulates the object This bounding box is represented by several parameters:

– Center Coordinates: The location of the center of the bounding box in3D space (x, y, z).

– Dimensions: The width, height, and depth of the bounding box.

– Orientation: The orientation of the bounding box in the 3D space, often represented by a rotation matrix or quaternion.

• Object Class Labels: PointRCNN classifies the detected objects into categories such as "car," "pedestrian," "cyclist," etc.

• Confidence Scores: PointRCNN provides a confidence score for each detection, which indicates the model’s certainty about the classification of the object.

Challenges and Future Directions

Despite its impressive performance, PointRCNN faces several challenges:

• Computational complexity: Working directly with point clouds, especially in high-density scenarios, can be computationally expensive Optimizing the network for real-time applications remains a significant challenge.

• Occlusion handling: While PointRCNN performs well in many scenarios, occlusions (where objects are partially or fully blocked by other objects) can still lead to missed detections, especially in crowded environments.

Future work on PointRCNN may focus on improving computational efficiency,handling occlusions better, and integrating the model with other sensor modalities (e.g., cameras) for more robust multimodal object detection.

Conclusion

PointRCNN is a powerful, state-of-the-art model for 3D object detection that works directly with raw LiDAR point clouds By leveraging advanced deep learning techniques such as PointNet++ and Region Proposal Networks, PointR-CNN achieves high accuracy in detecting and localizing objects in complex 3D environments Its end-to-end trainable architecture allows for precise detection and localization of objects, making it an ideal solution for applications in autonomous driving, robotics, and urban planning.

Particle Filter-Based Tracking

State Representation with Particles

In Particle Filter tracking, the state of each tracked object is represented by a set of particles Each particle corresponds to a possible hypothesis about the object’s state The state of an object can be described by multiple parameters, such as position, velocity, and orientation (for 3D MOT, it could be the 3D coordinates, velocity in the 3D space, and orientation of the object).

• Particles are typically randomly initialized or sampled based on a prior be- lief about the object’s state, such as the object’s previous state or a rough estimation of where the object might be located in the next frame.

• Each particle represents a hypothesis of the object’s current state (location,velocity, etc.) and is associated with a weight that indicates how likely it is to be the correct state based on the current observation.

Prediction Step

In the prediction step, the Particle Filter propagates each particle according to a motion model, which defines how the object is expected to move over time. This model can incorporate a range of factors, including velocity, acceleration, or even complex motion patterns (such as turning or stopping abruptly) The prediction step involves applying some motion model to each particle, which updates the state of the particles based on the last known information.

The control inputs could represent factors like velocity, acceleration, or steering angle, depending on the object being tracked For instance, for a car in a 3D MOT task, the control inputs could be the steering angle and velocity, predicting the car’s new position based on these inputs.

Each particle’s state is updated, and as a result, the particles are spread across possible locations in the state space, creating a set of hypotheses about where the object might be in the next frame.

Update Step

After the prediction step, the Particle Filter compares the predicted state of each particle with the new observation (detection from PointRCNN) This comparison is done by calculating a likelihood function, which measures how well each particle’s predicted state matches the actual measurement (detection) in the current frame.

• The likelihood is computed based on how close the predicted state (position, velocity, etc.) is to the observed detection The observation can be a 3D bounding box or other measurements (e.g., object class, orientation,etc.).

• The likelihood function assigns a weight to each particle If a particle’s predicted state closely matches the actual detection, it receives a high weight.Conversely, if the particle’s predicted state is far from the actual detection,its weight is lower.

Resampling Step

The resampling step ensures that particles with higher weights (i.e., those that better match the observation) are more likely to be selected in the next prediction cycle This step helps eliminate particles that are less likely to represent the correct state, ensuring that the particle set becomes more concentrated around the true state of the object.

• Resampling works by sampling particles based on their weights Particles with higher weights (higher likelihood of being correct) are selected more often, while particles with lower weights are less likely to be chosen.

• This process reduces the variance in the particle set and focuses the computational resources on the most likely hypotheses.

In practice, resampling is performed using algorithms like Systematic Resam- pling or Multinomial Resampling, which are designed to reduce sample deple- tion and avoid an excessive concentration of particles in a small region of the state space.

Track Maintenance

Once the resampling step is complete, the Particle Filter produces an updated estimate of the object’s state The tracker can now use this updated estimate for further processing, such as associating the object with a new detection in the next frame or updating the 3D bounding box of the object.

• Track creation occurs when a detection is not associated with any existing track, resulting in a new particle filter being initialized for that object.

• Track deletion occurs if the object is not detected for a number of frames, meaning its particles will have a very low likelihood and eventually be discarded.

In particle filtering-based tracking, the tracking system will maintain multiple particles for each object across frames, continuously refining the state estimate and improving the accuracy of the object’s position, velocity, and orientation.

Advantages of Particle Filter-Based Tracking

• Nonlinear Motion Handling: Particle Filters are well-suited for tracking objects that exhibit nonlinear or unpredictable movement patterns, such as objects that turn sharply, accelerate rapidly, or stop suddenly.

• Multi-modal Distributions: Since Particle Filters represent a set of hypotheses (particles), they can handle situations where the object might be in multiple places at once (e.g., due to occlusion or sensor noise), a property that is hard to achieve with Kalman Filters.

• Flexibility: Particle Filters do not require strong assumptions about the system’s dynamics, which makes them highly adaptable for various applications, including autonomous driving and robotics.

Particle Filter-Based Tracking

State Representation in Kalman Filter

In Kalman Filter tracking, the state of each object is represented by a state vector and a covariance matrix:

• State Vector: Describes the object’s parameters, such as position, velocity, and possibly acceleration in 3D space For example, in 3D MOT, the state vector might include the 3D position (x,y,z), velocity components (v x ,v y ,v z ), and acceleration components(a x ,a y ,a z )

• Covariance Matrix: Represents the uncertainty in the state vector, quantifying the confidence in each state parameter.

The Kalman Filter operates under the assumption that the motion of the object can be described by a linear system and that measurement noise follows aGaussian distribution.

Track Maintenance

In Kalman Filter-based tracking, tracks are created, updated, and deleted based on the following principles:

• Track Creation: When a detection does not match any existing track, a new Kalman Filter is initialized for the detected object.

• Track Update: When a detection is associated with an existing track, the Kalman Filter updates the state estimate using the new observation.

• Track Deletion: If an object is not detected for several consecutive frames,the associated track is deleted to reduce false positives.

Advantages of Kalman Filter-Based Tracking

• Computational Efficiency: The Kalman Filter is computationally lightweight, making it suitable for real-time applications like 3D MOT in autonomous driving and robotics.

• Optimal for Linear Systems: When the motion dynamics are approximately linear, the Kalman Filter provides highly accurate state estimates.

• Noise Handling: The Kalman Filter effectively handles Gaussian noise, smoothing noisy detections for improved tracking stability.

• Wide Applicability: Despite its linear and Gaussian assumptions, the KalmanFilter is widely used in applications ranging from vehicle tracking to object detection in cluttered environments.

Comparison of Kalman Filter and Particle Filter

In the context of 3D Multiple Object Tracking using the KITTI dataset, both the Kalman Filter and Particle Filter have unique strengths and limitations TheKalman Filter assumes that system dynamics are linear and that noise in the system and measurements follows a Gaussian distribution, making it ideal for tracking objects with predictable, smooth motions where a linear motion model suffices However, this assumption makes it less effective in handling nonlinear motions Although extensions like the Extended Kalman Filter and UnscentedKalman Filter attempt to address nonlinear dynamics, they add complexity and still fall short in highly nonlinear scenarios In contrast, the Particle Filter does not rely on such restrictive assumptions, allowing it to handle complex and nonlinear dynamics effectively This makes it particularly suitable for tracking objects in the KITTI dataset, which often exhibit irregular or highly dynamic motions, such as abrupt stops, sharp turns, or high accelerations.

When it comes to noise and uncertainty, the Kalman Filter assumes Gaussian noise and provides a single deterministic estimate of an object’s state While this approach works well in controlled scenarios, it limits the filter’s ability to model systems with non-Gaussian uncertainties or multi-modal distributions, which are common in real-world datasets like KITTI due to occlusions and ambiguous detections On the other hand, the Particle Filter is inherently better equipped to handle non-Gaussian noise and multi-modal distributions By representing the posterior distribution with a set of particles, it can account for uncertainty and ambiguity in challenging tracking scenarios.

Another critical difference lies in computational complexity The Kalman Fil- ter is computationally efficient, making it suitable for real-time applications and resource-constrained systems Conversely, the Particle Filter is computationally intensive, as it requires maintaining and updating a large number of particles. The computational cost of the Particle Filter scales with both the number of particles and the complexity of the state space, which can be demanding when working with high-resolution 3D data like KITTI.

In terms of robustness to occlusions and ambiguities, the Kalman Filter performs well in scenarios with minimal occlusions or ambiguities However, it struggles to maintain accurate tracking when the motion deviates significantly from the assumed linear model or when observations become sparse The Particle Fil- ter, by maintaining multiple hypotheses about the object’s state, is more robust in handling these challenges This ability to recover from temporary losses or poorly detected objects makes it highly effective for KITTI’s real-world scenarios, which frequently involve occlusions and cluttered environments.

Finally, flexibility and adaptability are notable distinctions between the two methods The Kalman Filter’s reliance on specific motion and noise models makes it relatively rigid, requiring significant modifications or switching between models to adapt to complex scenarios In contrast, the Particle Filter’s design is highly flexible and adaptable, making it suitable for a broad range of tracking tasks, including autonomous driving and robotics For the KITTI dataset, which features diverse objects and motion patterns, the Particle Filter offers superior adaptability and tracking performance.

This chapter explores advancements in multi-object tracking (MOT) systems,emphasizing state-of-the-art techniques that integrate feature association networks and multi-modality approaches These innovations address challenges in dynamic and complex environments, particularly in autonomous driving scenarios.

FANTrack: 3D Multi-Object Tracking with Feature Associa-

Dataset

The KITTI Tracking benchmark dataset serves as the backbone for training and evaluating FANTrack[7] This dataset is widely recognized in the autonomous driving domain for its diverse and challenging scenarios, including occlusions,clutter, varying illumination, and dynamic environments The KITTI dataset consists of 21 training sequences and 29 test sequences, offering a robust platform for evaluating object tracking methodologies Each sequence provides ground truth annotations for object bounding boxes, motion trajectories, and class labels, ensuring comprehensive evaluation.

For training SimNet[7], the dataset was augmented with a mix of positive and negative examples, derived by applying geometric transformations such as trans- lation, rotation, and scaling to ground truth bounding boxes This augmentation strategy addresses issues like partial occlusions and detector noise, resulting in a dataset with an approximate ratio of 18 negative examples to 25 positive examples This balanced dataset supports the network’s ability to generalize well across diverse conditions For AssocNet[7], the association probabilities were trained using a similar strategy, with association masks generated to focus on regions of interest, minimizing computational overhead The KITTI dataset’s challenging test sequences allowed for rigorous benchmarking of FANTrack against state-of-the-art methods, demonstrating competitive performance and validating its robustness in real-world applications.

Architecture

FANTrack employs a sophisticated architecture that integrates convolutional neural networks (CNNs) into a tracking-by-detection[8] framework for online multi-object tracking (MOT) The core innovation lies in the Feature Associa- tion Network (FAN), which leverages a Siamese network to learn robust similarity functions between tracked targets and incoming detections This network processes visual and 3D bounding box[7] data to create high-quality matching costs, making it resilient to challenges such as noisy detections, occlusions, and the time-varying number of targets.

The architecture consists of two primary components: SimNet and AssocNet. SimNet, a Siamese network[7], generates similarity scores by modeling pair- wise relationships between targets and detections It incorporates two branches: a bounding box branch for 3D object geometry and an appearance branch for visual features These branches produce L2-normalized feature vectors, which are combined using learned importance weights to compute cosine similarity[7] scores The similarity scores are compiled into local maps that represent the detection probabilities within each target’s vicinity AssocNet then processes these local maps using dilated convolutions[9] and fully connected layers to predict the assignment probabilities This modular design leverages dilated convolutions to compensate for the sparsity in local similarity maps, enhancing accuracy The overall framework ensures efficient and accurate data association by integrating spatial and visual cues Additionally, FANTrack’s reliance on convolutional and fully connected layers with regularization ensures generalizability across various scenarios, making it adaptable and easy to train.

Figure 3.1: Overall architecture of the FANTrack model[7]

Robust Multi-Modality Multi-Object Tracking

Dataset

The authors conducted extensive experiments to evaluate the performance of the mmMOT[10] framework, particularly on the KITTI benchmark, a widely used dataset for autonomous driving research The KITTI dataset provides an- notated data for object detection and tracking, including 3D and 2D bounding boxes for vehicles across various environmental conditions For the purposes of their experiments, the authors used the tracking portion of the KITTI dataset, where objects are tracked across consecutive frames This dataset is particularly challenging due to issues like occlusions, varying lighting conditions, and sensor failures The authors split the dataset into a training set and a validation set, with a roughly equal distribution of frames across both sets The tracking results on KITTI were reported in terms of MOTA (Multi-Object Tracking Accuracy), ID switches, false positives (FP), and false negatives (FN) These metrics were used to compare the performance of different versions of the mmMOT framework under varying conditions, including both full multi-sensor inputs and single-sensor failure scenarios The KITTI dataset allowed the authors to demonstrate the robustness of the mmMOT framework, particularly under conditions where certain sensors (like the camera or LiDAR) failed or provided noisy data, highlighting the efficacy of the fusion mechanism and its ability to maintain high tracking accuracy across different modalities.

Architecture

The multi-modality Multi-Object Tracking (mmMOT) [10] framework proposed introduces a robust architecture designed to enhance the reliability and accuracy of tracking dynamic objects in autonomous driving systems The framework follows a tracking-by-detection paradigm, where multiple sensor modalities are used to track objects over time The core architecture consists of several key components: first, each sensor modality (image, LiDAR, etc.) independently extracts features using dedicated feature extractors, such as VGG-16[10] for images and PointNet[11] for point clouds This loose coupling of modalities ensures high reliability in the presence of sensor failures, as each sensor operates independently Next, these modality-specific features are fed into a fusion module that integrates the information across different modalities To further refine the tracking performance, an adjacency estimator is employed to predict associations between object detections across consecutive frames This estimator is designed to handle the correlations between different modalities using an efficient matrix learning approach The entire framework is trained end-to-end, enabling joint optimization of both the feature extraction process and the modality fusion mechanism One of the novel aspects of the architecture is the introduction of deep representation of LiDAR point clouds in the data association process, a first in the field of multi-modality MOT The robust fusion and adjacency estimation components of the framework are essential in improving tracking accuracy and ensuring that object trajectories are correctly associated, even in challenging conditions like sensor malfunction or low visibility.

3.3 3D Multi-Object Tracking: A Baseline and New

Dataset

The report evaluates the proposed 3D multi-object tracking (MOT) system using two widely recognized datasets: KITTI and nuScenes The KITTI dataset provides LiDAR point clouds and 3D bounding box trajectories, primarily focusing on autonomous driving scenarios However, it only supports 2D MOT evaluation, so the authors developed a new evaluation tool to assess performance directly in 3D space The KITTI validation set was used for evaluation as the test set does not provide access to ground truth labels For the nuScenes dataset, which features more complex scenes, sparse LiDAR data, and lower frame rates compared to KITTI, evaluations were conducted using the validation set, as the first 3D MOT challenge for nuScenes had not been finalized during the report’s preparation Results were reported across multiple categories, including cars,pedestrians, and cyclists in KITTI, and seven categories (e.g., car, truck, pedestrian) in nuScenes Metrics such as Intersection over Union (IoU) and center distance thresholds were used to evaluate the accuracy of matches.

Architecture

The proposed 3D MOT system follows a modular and efficient pipeline designed for real-time performance It starts with a 3D object detection module that extracts 3D bounding boxes from LiDAR point clouds using pre-trained detectors A 3D Kalman filter with a constant velocity model predicts the state of tracked objects, including position, size, velocity, and orientation Data association is performed using the Hungarian algorithm to match detections with predicted trajectories based on 3D IoU or center distance The system then updates trajectory states through Bayesian filtering and applies an orientation correction technique to handle angular inconsistencies A birth and death memory module manages object initialization and removal, ensuring robustness against false positives and missing detections Notably, the system avoids neural networks, emphasizing simplicity and efficiency, achieving up to 2074 FPS on KITTI without GPUs This architecture highlights a balance between accuracy and computational efficiency, establishing a baseline for future 3D MOT systems.

Figure 3.2: Overall architecture of the AB3DMOT [12]

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion

Dataset

The EagerMOT [13] framework was evaluated on two prominent autonomous driving datasets: KITTI and nuScenes The KITTI dataset provides high-resolution LiDAR data and 2D/3D object annotations, making it a standard for benchmarking 3D MOT systems KITTI focuses on simpler urban environments with fewer occlusions and higher frame rates Evaluations utilized the validation set, as the ground truth of the test set is not publicly accessible.

The nuScenes dataset offers a more challenging evaluation environment, fea- turing a broader array of object categories, more complex urban scenarios, and a significantly lower frame rate of 2 FPS This dataset integrates multi-camera images with sparse LiDAR point clouds, enabling evaluation of sensor fusion capabilities Metrics such as Average Multi-Object Tracking Accuracy (AMOTA),MOTA, and HOTA were used to evaluate detection, tracking, and association accuracy across categories like cars, pedestrians, and bicycles.

Architecture

The EagerMOT[13] system employs a sensor fusion approach, combining com- plementary information from 2D image-based detectors and 3D LiDAR detectors to achieve robust tracking The architecture includes:

1 Detection and Fusion: At each frame, 2D bounding boxes from image detectors are matched with 3D bounding boxes from LiDAR-based detectors based on 2D IoU in the camera plane This fusion produces unified object instances containing both 2D and 3D information.

2 Data Association: A two-stage association process matches detections to tracks:

• Stage 1 uses 3D bounding box distance and orientation similarity for association.

• Stage 2applies 2D IoU to associate unmatched instances from the first stage, enabling robust tracking despite partial observations.

3 State Updates: Object states, including 3D position, velocity, and orientation, are updated using a Kalman filter Tracks can be updated with either 2D or 3D observations, ensuring resilience to sensor failures or occlusions.

4 Lifecycle Management: Tracks are initialized when new detections ap- pear and terminated if no updates occur for a predefined number of frames. Tracks are confirmed based on sufficient temporal consistency in either 2D or 3D observations.

Chapter 4 presents the proposed solution

Discussion

Below is the summary results table of the models about car that I mentioned in chapter 3:

Table4.1: 3D MOT Models: Performance Metrics Overview about Car

Model sAMOTA MOTA MOTP IDS mmMOT 70.61 74.07 78.16 10

I have decided to choose the AB3DMOT model for further development due to its outstanding performance metrics Specifically, AB3DMOT achieves a sAMOTA score of 91.78 and a MOTA score of 83.35, which are higher than those of other models such as mmMOT and FANTrack Notably, AB3DMOT has an IDS score of 0, demonstrating its ability to accurately track targets without identity switches Although its MOTP score (78.43) is only slightly higher than other models, the balance between accuracy and stability makes AB3DMOT the ideal choice for further enhancement and optimization.

I plan to replace the Kalman Filter with the Particle Filter as the tracking method in my 3D MOT model to potentially achieve better performance and more reliable results While the Kalman Filter is efficient and effective for scenarios with linear dynamics and Gaussian noise, it has limitations in handling the complexity and variability often encountered in real-world tracking environments.

In contrast, the Particle Filter offers greater flexibility by accommodating nonlinear dynamics and non-Gaussian noise, making it more suitable for scenarios involving abrupt changes in motion, high accelerations, or rotations.

The Particle Filter’s ability to maintain multiple hypotheses about an object’s state allows it to handle challenges like occlusions and ambiguous detections more robustly, reducing the risk of losing tracks in crowded or cluttered scenes. Additionally, it supports richer state representations, enabling more accurate modeling of object parameters such as position, velocity, orientation, and potentially even object-specific dynamics This adaptability is especially critical in applications like autonomous driving or robotics, where tracking conditions can vary significantly.

By leveraging these advantages, the Particle Filter is expected to improve the accuracy and reliability of the tracking results, particularly in scenarios with complex object motions and challenging environmental conditions These improvements are likely to reflect in enhanced performance metrics, such as MOTP and MOTA, making the Particle Filter a more effective choice for my 3D MOT model.

Architecture

Figure 4.1: Architecture of proposed model

The proposed model architecture is designed to address the challenge of 3D object detection and tracking in dynamic environments It is composed of a modular pipeline that integrates data preprocessing, 3D object detection, and tracking using a particle filter approach The architecture, as illustrated in the diagram, consists of five primary stages: Raw Data, Data Preprocessing, 3D Object De- tection (PointRCNN), Particle Filter, and Result.

The first stage, Raw Data, represents the input collected from sensors These raw data points contain unstructured and noisy information, which cannot be directly used for object detection and tracking Therefore, the next stage focuses on preparing this data for subsequent processing.

Data Preprocessing plays a crucial role in normalizing and structuring the raw data This stage involves point cloud processing, coordinate transformations, and filtering techniques to remove noise and irrelevant points It ensures that the data input to the detection model is well-organized and optimized for performance.

In the third stage, 3D Object Detection (PointRCNN) is employed to detect objects in the preprocessed data PointRCNN, a state-of-the-art detection algorithm, uses raw point cloud data as input and generates 3D bounding boxes for detected objects It leverages a two-stage architecture where the first stage proposes object regions, and the second stage refines these proposals PointR-CNN is particularly effective for 3D object detection due to its ability to process sparse point clouds directly without voxelization, preserving spatial information and improving detection accuracy.

The fourth stage, Particle Filter, is utilized for tracking detected objects over time This probabilistic approach estimates the state of each object by maintaining a set of particles representing possible states It recursively updates these particles based on observations and motion models, enabling robust tracking even in the presence of occlusions and sensor noise The particle filter also helps to handle nonlinear motion patterns and uncertainty effectively.

Finally, the Result stage outputs the processed data, including tracked objects and their trajectories This information can be used for higher-level applications such as autonomous driving, robotic navigation, and surveillance systems The modular nature of the architecture allows flexibility for integration with other models or techniques, making it adaptable to various scenarios and datasets.

In summary, this architecture combines the strengths of PointRCNN for accurate 3D object detection with the robustness of particle filters for reliable tracking.

By systematically processing raw data and leveraging advanced algorithms, it provides a comprehensive solution for 3D object detection and tracking tasks.

Data preprocessing

Detection Data Preparation

The first step in the data preprocessing pipeline is to load the raw detection data for each frame This data typically consists of 2D and 3D bounding boxes, object class labels, and confidence scores These detections are usually produced by a separate detection model, such as PointRCNN or Megvii, which processes the raw sensor data (e.g., LiDAR or cameras) to identify objects within the scene.

Once the detection data is loaded, it is filtered to ensure only valid detections are included Detections with missing information or incomplete data (such as empty bounding boxes or missing confidence scores) are discarded This ensures that the subsequent tracking process operates on clean, reliable inputs. Each frame’s data is processed individually, with relevant features like object orientation and the 3D bounding box parameters being extracted for further use in the tracking pipeline.

Handling Frame-specific Data

IIn the context of 3D Multiple Object Tracking (MOT), each frame in a sequence may contain multiple objects, each with varying levels of detection quality Some objects may have been detected clearly, while others might have lower confidence scores or may even be missed entirely The preprocessing system addresses these challenges by organizing the detections based on the frame number, ensuring that the data for each frame is consistent and properly formatted.

Additionally, each detection includes supplementary information that is crucial for tracking For example, object orientation (which describes the rotation of the bounding box), as well as other metadata such as the object’s class (car, pedestrian, etc.), are extracted and stored for each detection This helps the tracking system to better associate detections across frames and maintain accurate object identities throughout the sequence.

Setting Up Directories and Paths

To streamline the data management process, the system sets up a well-organized directory structure This structure includes separate directories for saving tracking outputs, affinity matrices, and debugging files The tracking results are stored in files that correspond to the specific sequence, frame, and hypothesis number Affinity matrices, which describe the relationships between objects detected in consecutive frames, are also saved in dedicated directories These matrices are used to evaluate the quality of object associations during the tracking process.

By establishing this organized directory structure, the system ensures that the data is easy to access and manage throughout the tracking pipeline It also facil- itates the evaluation and comparison of results across different object categories and hypotheses.

Tracker Initialization

Once the detection data is prepared, the tracking system is initialized This initialization phase involves loading several important components, such as calibration data, ego-motion information (e.g., from IMU or GPS sensors), and image data for visualization purposes The calibration data is crucial because it allows the tracker to convert between different sensor modalities, ensuring that object locations can be accurately represented in the 3D space Ego-motion data is used to account for the movement of the vehicle or sensor platform, helping to correct for motion-related distortions when associating objects across frames.

With these components loaded and the environment set up, the tracker is ready to begin associating objects across frames, handling object occlusions, and man- aging object identities throughout the sequence.

Tracking and Saving Results

With the tracker initialized, the system begins processing each frame in the sequence For each frame, the tracker matches new detections to existing tracks, creating new tracks for newly detected objects and updating the states of existing tracks for previously detected objects The tracking process involves both associating detections across frames and handling complex scenarios, such as object occlusions or missed detections.

As the tracker operates, it produces several outputs The primary outputs are the 2D bounding boxes, 3D positions, and orientations for each object These tracking results are saved in standard formats that can be used for further evaluation, such as the 3D MOT challenge format Additionally, the system saves the affinity matrices that capture the strength of associations between objects across consecutive frames These matrices are important for assessing the consistency of object tracking and are used in the evaluation phase to measure tracking performance.

Post-Processing and Result Integration

After the tracking phase is completed for each sequence, the results for different object categories (e.g., cars, pedestrians, cyclists) are merged into a unified tracking output This post-processing step ensures that the results are organized consistently across object types and frames, facilitating the comparison and evaluation of tracking performance The post-processing also includes combining results from different hypotheses, if multiple tracking hypotheses were used.

Once the results are combined, they are ready for further analysis and evaluation, including performance metrics like MOTA (Multiple Object TrackingAccuracy) and other relevant tracking metrics This final step ensures that the tracking outputs are in a format suitable for benchmarking against other tracking methods and for assessing the overall effectiveness of the tracking system.

Tracking Algorithms in 3D MOT

Tracking-by-Detection

Tracking-by-Detection is one of the most commonly used approaches in 3D Multiple Object Tracking (3D MOT), particularly in scenarios where detections come from a pre-trained object detection model like PointRCNN This approach involves associating each detection in a frame with an existing track, or creating new tracks for newly detected objects.

• After running the PointRCNN model, you get a list of 3D bounding boxes and their associated object class labels for every object detected in a particular frame These objects might belong to different categories such as cars,pedestrians, cyclists, etc.

• The tracking system must then associate each new detection in the current frame to an existing track If the detection matches an existing object, it will be linked to that track, otherwise, a new track will be created.

• Data association refers to the task of correctly linking a detection from the current frame to a specific object being tracked across previous frames. This is crucial for maintaining consistent object identities.

• In many cases, the Hungarian algorithm or Global Nearest Neighbor (GNN) approaches are used to solve the association problem These algorithms compute a cost matrix where the cost represents how likely a detection matches a given track The cost is typically calculated based on:

– Spatial proximity: How close the center of the 3D bounding box of the detection is to the predicted center of the tracked object.

– Motion similarity: How well the detection’s velocity or direction matches the predicted trajectory of the object based on its past movements.

• The Hungarian algorithm solves this as an optimization problem by finding the best matching of detections to tracks, minimizing the cost function.

• Once a detection is associated with a track, the tracker updates the object’s state The object’s state could include its location, velocity, orientation, and dimensions.

• The update is done by adjusting the track’s 3D bounding box parameters based on the new detection For example, if a car’s position has moved from one frame to the next, the track’s bounding box is updated accordingly.

• Track creation happens when a detection does not match any existing track in a frame This occurs when a new object appears in the scene or when an object is detected after a brief occlusion.

• Track deletion happens when a track is not associated with any detection across multiple frames (for example, if the object has left the scene or is occluded for too long) When this occurs, the tracker deletes the track from the system to avoid cluttering the memory with non-existent objects.

Tracking-by-detection is effective because it separates detection and tracking into two independent tasks However, it can struggle with issues such as occlusions (where an object is temporarily hidden by another object) and identity switches (where the tracker mistakenly swaps the identities of two objects, usually in cases of close proximity).

Results

Định dạng
Số trang	73
Dung lượng	13,74 MB