Vehicle detection, tracking and behavior analysis with enhancing depth information

INTRODUCTION

Topic Reasoning

As automotive technology evolves, the integration of Advanced Driver Assistance Systems (ADAS) is crucial for achieving safer and more efficient transportation ADAS applications are transforming vehicle interactions with their environment, offering drivers enhanced safety features and intelligent support Designed to reduce risks and accidents, these systems significantly improve the overall driving experience The growing need for ADAS stems from the urgent demand to enhance road safety, optimize vehicle performance, and facilitate the transition towards autonomous driving.

ADAS applications are primarily driven by the goal of enhancing road safety, addressing the significant risk of accidents caused by human error, which accounts for 41.0% of severe crashes These technologies utilize sensors, cameras, and algorithms for real-time monitoring and hazard detection, providing warnings to drivers and enabling autonomous interventions when needed By improving human perception and reaction times, ADAS aims to prevent accidents, reduce their severity, and ultimately save lives.

The analysis of vehicle behavior is essential for enhancing Advanced Driver Assistance Systems (ADAS) applications, as it ensures safety and efficiency on the road Vehicle behavior analysts examine the interactions among vehicles, road conditions, and drivers, utilizing ADAS technologies to collect real-world data and assess driving patterns This information is crucial for improving ADAS algorithms, enabling more precise risk detection and effective interventions Ultimately, the collaboration between ADAS applications and vehicle behavior analysis fosters a safer and smarter driving environment for all road users.

Our framework for vehicle detection, tracking, and behavior analysis is a comprehensive solution that addresses the demands for enhanced road safety and intelligent driver assistance

Our innovative framework utilizes cutting-edge computer vision and machine learning to effectively detect and track vehicles in real-time, while also analyzing their behaviors This technology empowers us to deliver prompt alerts to drivers regarding any unusual actions displayed by other vehicles on the road.

Our integrated system combines vehicle detection, tracking, and behavior analysis to enhance drivers' situational awareness and safety on the road By identifying and alerting drivers to abnormal behaviors of surrounding vehicles, we significantly reduce accident risks and promote overall road safety.

Objectives

The project aims to create a sophisticated framework for vehicle detection, tracking, and behavior analysis to improve road safety and driver assistance By integrating data from multiple sensors and utilizing advanced computer vision and machine learning techniques, the framework will accurately detect and track vehicles in real-time while analyzing their behavior in traffic scenarios to identify abnormal patterns This initiative seeks to provide timely warnings and alerts to drivers about potential risks and hazards posed by other vehicles on the road.

The project seeks to improve road safety through advanced algorithms for comprehensive driving perception, emphasizing the precise identification of vehicles in diverse environmental conditions This foundational work is essential for thorough analysis, while the development of innovative and resilient tracking algorithms ensures continuous vehicle tracking across frames, even in difficult situations like occlusions and intricate traffic patterns.

Overall, the aim of the project is to:

• Perform multi-sensor data combination, i.e., stereo camera, to augment information for subsequent behavior analysis;

• Integrate a multitasking tasking model for panoptic driving perception to enable vehicle detection and scene recognition;

• Design a novel and reliable feature-based tracking framework that fully exploits distinct vehicle appearance that outperforms classic motion-based trackers;

• Integrate panoptic driving perception, vehicle tracking, and behavior analysis to enhance road safety by providing intelligent driver assistance.

Project Content

The project content for the development of the vehicle detection, tracking, and behavior analysis framework can be organized into several key components:

• Conduct a comprehensive review of existing research and literature on vehicle

3 detection, tracking, and behavior analysis

• Explore state-of-the-art computer vision and machine learning techniques applicable to the project's objectives

• Identify relevant algorithms, methodologies, and datasets for training and evaluation

• Gather video streams or camera feeds capturing real-world traffic scenarios

• Preprocess the data by calibrating cameras, performing image enhancement, and ensuring data quality

• Annotate the data to create ground truth labels for training and evaluation purposes

1.3.3 Vehicle Detection and Driving Scene Recognition

• Implement and integrate advanced object detection and panoptic driving perception algorithm to accurately detect vehicles and recognize traffic scenes in the video streams

• Optimize the detection algorithm for real-time processing to ensure efficient performance

• Develop tracking algorithms (e.g., Kalman filters, particle filters, deep association methods) to robustly track vehicles across frames

• Utilize the detection results and incorporate motion models to maintain tracking consistency

• Handle challenges such as occlusions, varying lighting conditions, and complex traffic scenarios to ensure accurate and reliable tracking

• Define and extract relevant features (e.g., speed, acceleration, lane changes, proximity) to characterize vehicle behavior

• Develop algorithms to analyze and interpret the extracted features, identifying normal and abnormal behavior patterns

• Design rules to classify and flag abnormal behaviors that may pose risks or hazards

• Integrate the behavior analysis module with a warning and alert system

• Implement a user-friendly interface to provide timely notifications to the driver about detected abnormal vehicle behaviors

• Document the entire development process, including methodologies, algorithms, and implementation details

• Prepare comprehensive reports outlining the project objectives, methodology, results, and future recommendations

• Create user guides or technical documentation for the deployed framework to facilitate its utilization and maintenance

• Continuously refine and enhance the framework based on user feedback, emerging research, and advancements in the field

• Consider scalability and adaptability to different environments, sensors, and vehicle types for future expansion and integration

The team can systematically create a comprehensive framework for vehicle detection, tracking, and behavior analysis, aimed at improving road safety and providing intelligent driver assistance.

Research Method

For the vehicle detection component, the research methodology employed a combination of existing datasets and pretrained models The following approach was adopted:

1) Dataset Selection: To train and evaluate the vehicle detection model, publicly available datasets were utilized, including KITTI [2] and BDD100K [3] dataset These datasets offer diverse real-world traffic scenarios, which are essential for training a robust detection model

2) Pretrained Model Integration: A pretrained object detection model was selected as the foundation for the detection component This choice was made to leverage the advancements achieved by these models in detecting objects, including vehicles, in images or video streams

3) Fine-tuning and Post-Processing: The selected pretrained model was fine-tuned on the specific dataset used in this project Fine-tuning involved training the model using annotated data from the chosen dataset to adapt it to the specific requirements and

5 challenges of vehicle detection in the target environment

A new tracking framework has been created to ensure precise and dependable vehicle tracking across multiple frames The methodology for this vehicle tracking research is clearly defined and structured.

1) Framework Design: The tracking framework was designed based on well-established tracking methodologies, including Kalman filters, particle filters, and deep association networks These methods were chosen for their proven effectiveness in maintaining tracking consistency, even in challenging scenarios such as occlusions, varying lighting conditions, and complex traffic patterns

2) Implementation: The tracking framework was implemented using appropriate programming languages (e.g., Python) and relevant computer vision libraries (e.g., OpenCV) The framework incorporated motion models and optimization techniques to ensure robust tracking performance

3) Evaluation and Refinement: The developed tracking framework was evaluated using representative datasets and real-world scenarios Performance metrics such as accuracy, precision, recall, and F1 score were utilized to assess its effectiveness Based on the evaluation results, necessary refinements and optimizations were made to enhance the tracking accuracy and robustness

The behavior analysis component involved developing a custom method for characterizing and identifying abnormal behaviors exhibited by vehicles The research methodology for behavior analysis is as follows:

1) Feature Extraction: Relevant features, such as speed, acceleration, lane changes, and proximity, were defined and extracted from the tracked vehicle data These features were chosen based on their significance in capturing behavioral patterns

2) Algorithm Design: Algorithms and rules were developed to analyze and interpret the extracted features This involved defining thresholds or developing machine learning models to classify and flag abnormal behavior patterns The algorithms were designed to detect behaviors such as sudden lane changes, aggressive acceleration or braking, or frequent tailgating

3) Evaluation and Validation: The behavior analysis method was evaluated using real- world datasets and scenarios The effectiveness of the algorithm in detecting abnormal behaviors was assessed based on performance metrics, qualitative analysis, and

6 comparisons with ground truth annotations The method was iteratively refined and validated to enhance its accuracy and reliability

The research methodology employed a comprehensive and customized strategy for vehicle detection, tracking, and behavior analysis By integrating existing datasets, pretrained models, innovative tracking frameworks, and bespoke behavior analysis techniques, the project sought to provide precise and dependable outcomes to improve road safety and enhance driver assistance.

Limitations

The advanced framework for vehicle detection, tracking, and behavior analysis requires significant computational resources and processing power, which may surpass the capabilities of standard embedded systems or low-power devices This challenge restricts the feasible integration of the framework into practical hardware platforms, making real-world deployment difficult.

The framework's real-time performance can be hindered by the high computational demands of detection and tracking algorithms This may lead to suboptimal processing speeds, particularly when managing high-resolution video streams or fast-changing traffic situations, resulting in delays that affect the system's overall effectiveness.

1.5.3 False Positives and False Negatives

The abnormal behavior detection warning system can experience occasional false positives and false negatives, leading to misleading alerts for drivers and causing confusion or unnecessary actions To improve the reliability of this warning system, it is essential to further refine and fine-tune the behavior analysis algorithms, thereby reducing inaccuracies.

Environmental factors like fluctuating lighting, harsh weather, and complex traffic situations can significantly impact the framework's performance These conditions may lead to uncertainties that affect the accuracy and reliability of detection, tracking, and behavior analysis, potentially causing false positives or missed detections.

The developed framework may face challenges in handling new traffic scenarios not sufficiently covered in the training datasets, leading to potential inaccuracies in vehicle detection, tracking, and behavior analysis To improve its generalization capabilities, further research and enhanced data collection efforts are essential.

1.5.6 Human Factors and Driver Interaction

The effectiveness of a warning system is largely dependent on how drivers perceive and interpret its signals Factors like distractions and inattentiveness can hinder a driver's response to warnings, ultimately compromising road safety To enhance the system's impact, it is crucial to provide driver education and training focused on interpreting and responding to these warnings effectively.

The thesis report offers a balanced view of the project's challenges and constraints by acknowledging its limitations It also emphasizes the need for further research and development to address these issues, ultimately aiming to improve the framework's practicality and effectiveness.

LITERATURE REVIEW

Data Fusion Mechanism

The authors in [4] utilized Lidar technology to effectively detect, track, and classify pedestrians and vehicles within its frame of reference They transformed the detected object positions into the image frame to pinpoint the region of interest for the vision classifier To enhance classification accuracy, a sum decision rule was applied, integrating results from both the Lidar, which employed a Gaussian Mixture Model (GMM) classifier, and the vision system, which utilized an AdaBoost classifier.

In [5], the authors utilized Lidar as the main sensor in their sensor fusion system, focusing on detecting moving objects They achieved this by examining discrepancies between free and occupied cells in the map; the detection of an occupied cell in a previously free area signifies the presence of a moving object.

Lidar detection system is then utilized as input for image classification using the Histogram of Gradient (HoG) descriptor [6]

Cho et al [7] proposed an innovative tracking system that consists of two main components: the sensor layer and the fusion layer The sensor layer independently detects and classifies data from each sensor, while the fusion layer integrates these results to generate the final outcome This layered architecture effectively separates the sensing hardware from the data processing involved in detection and tracking.

Lidar technology was utilized for detecting obstacles, complemented by a monocular camera for classification purposes Higher-level sensor fusion was implemented, focusing on detection certainty and the accuracy of the involved sensors.

Our innovative data fusion approach prioritizes stereo cameras over Lidar, leveraging the latest advancements in computer vision and deep learning for real-time, high-accuracy object detection By incorporating depth images into our behavior analysis algorithm, we enhance sensory data association and improve distance computation, ultimately enriching behavior analysis through enhanced depth information.

Distance estimation is essential for applications like Automated Driving and Robotics, and a cost-effective method for achieving this is through stereo camera vision By using a stereo camera setup, depth information can be derived from point correspondences via triangulation, allowing for the calculation of depth at specific points based on known disparities Disparity, the difference in position of a point between two stereo images, indicates proximity, with greater disparity values signifying closer objects Analyzing these disparities enables accurate distance estimation, enhancing depth perception for autonomous vehicles and robotic systems.

Disparity estimation algorithms are primarily classified into two categories: local methods and global methods Local methods assess the disparity of individual pixels by analyzing only their neighboring pixels, while global methods utilize information from the entire image for a more comprehensive estimation process.

Disparity levels are crucial in stereo vision algorithms for accurate depth estimation, as they define the search space for matching pixels between left and right images In the disparity map, each pixel in the left image is assigned a disparity value that indicates the horizontal displacement to its corresponding pixel in the right image.

The disparity levels, represented by D, influence the granularity of the search space in depth estimation A higher D value results in more disparity levels, leading to a finer search space that enables the detection of smaller depth differences for more accurate depth estimation However, it is important to note that increasing the disparity levels also raises the computational demands of the algorithm.

In the provided figure, 'w' denotes the image width and 'h' indicates the height The initial D columns of the left image remain unutilized due to the absence of corresponding pixels in the right image for comparison This process results in the creation of a disparity map, with disparity levels generated for each pixel in the left image.

Fig 1 Pixel level disparity between left and right image

The selection of disparity levels is influenced by the required minimum distance for depth detection and the available computational resources By increasing the number of disparity levels, the minimum detectable distance decreases, enabling the detection of objects at closer ranges However, this enhancement leads to greater computational complexity.

The resolution of input images significantly influences depth estimation, with higher resolutions enhancing accuracy while also raising the minimum detectable distance Typically, the number of disparity levels correlates with the input image resolution for identifying objects at the same depth Our approach employs the Semi-Global Block Matching (SGBM) algorithm, utilizing 64 disparity levels, as illustrated in Fig 2.

2.1.2 Distance Estimation based on Depth Information

After estimating the disparity, we can determine the depth, which then allows for an easy calculation of distance Figure 3 illustrates the stereo camera model used in this process.

11 Fig 2 Stereo disparity matching output of SGBM

The depth Z refers to the distance from a real-world point P to the camera This concept is illustrated in a stereo vision system featuring two parallel cameras, designated as C and C’ The baseline, denoted as B, represents the distance between the cameras, while f indicates the focal length Additionally, x and x’ represent the image planes of cameras C and C’, respectively.

By triangulation, we can compute the depth Z with the following formula, where (x - x’) is the disparity:

Depth and disparity are inversely related; as depth increases, disparity decreases, and vice versa It is important to maintain consistency in the units of measurement used for distance.

Multi-tasking Detection Model

Multi-task learning aims to create networks that efficiently learn shared representations by utilizing multiple supervisory signals from various tasks, making it significantly more effective than using separate networks for individual tasks.

Developing a multitasking model for panoptic driving presents several challenges that need to be addressed to achieve effective and reliable performance Some of these challenges include:

1) Computational Complexity: Building a multitasking model for panoptic driving involves integrating multiple tasks such as perception, prediction, planning, and control This leads to increased computational complexity, as each task requires substantial processing power and resources Efficiently managing the computational load and ensuring real-time performance across all tasks is a significant challenge

2) Task Interdependencies: The different tasks in panoptic driving are interconnected and interdependent Perception tasks, such as object detection and tracking, provide crucial inputs for prediction, planning, and control tasks Ensuring smooth coordination and information flow between tasks is challenging, as errors or delays in one task can propagate and affect the performance of others

3) Trade-off Between Accuracy and Efficiency: Balancing the accuracy of each task with computational efficiency is a challenge in multitasking models While achieving high accuracy is desirable, it often comes at the cost of increased

13 computational resources and processing time Striking the right balance between accuracy and efficiency is crucial to ensure real-time performance while maintaining acceptable levels of task performance

In this section, we present the BDD100K dataset, utilized for training and validating our Multitasking Detection Model BDD100K, or Berkeley DeepDrive 100K, is a comprehensive and diverse driving video dataset that is extensively used in the fields of computer vision and autonomous driving research Developed by researchers at the University of California, Berkeley, it features a vast array of labeled images and videos that encompass a wide range of driving scenarios.

The BDD100K dataset provides comprehensive insights into urban driving environments, featuring over 100,000 high-resolution images and accompanying video sequences It captures a variety of weather conditions, lighting scenarios, and traffic patterns across multiple geographical locations, including major cities like San Francisco and Beijing This diverse data collection enhances the dataset's applicability for research and development in autonomous driving technologies.

The BDD100K dataset features video clips sourced from densely populated areas across various cities and regions in the United States, with each dot on the map indicating the starting location of these clips.

The dataset offers detailed annotations for multiple tasks, including object detection, semantic segmentation, instance segmentation, and drivable area segmentation It includes object detection annotations for common road elements like cars, pedestrians, cyclists, traffic signs, and traffic lights Additionally, the semantic segmentation annotations deliver pixel-level labels for various classes, facilitating enhanced scene understanding and segmentation tasks, as demonstrated in Fig 5.

Fig 5 Overview of BDD100K dataset for multitasking purpose

BDD100K has established itself as a key dataset in computer vision and autonomous driving research, enabling the advancement and assessment of algorithms and models focused on perception, scene understanding, and decision-making in real-world driving situations A detailed technical representation of the instances included is illustrated in Fig 6.

The instance statistics of our object categories reveal a long-tail distribution in the number of instances across each category Notably, approximately 50% of these instances are occluded, while around 7% are truncated.

BDD100K plays a crucial role in advancing autonomous driving technologies, object detection algorithms, and scene understanding research Its extensive and varied dataset is widely utilized by researchers, providing a vital resource for the development and evaluation of algorithms applicable to real-world driving scenarios.

To evaluate multitasking models for panoptic driving, several metrics can be considered to assess their performance across different tasks:

Intersection over Union (IoU) is a crucial metric for evaluating the overlap between two bounding boxes, derived from the Jaccard Index Widely used in object detection tasks, IoU requires both a ground truth bounding box and a predicted bounding box to assess detection accuracy This metric plays a vital role in distinguishing true positives from false positives, ensuring the reliability of detection results The calculation of IoU involves dividing the area of overlap between the predicted and ground truth bounding boxes by the total area of their union.

• True Positive (TP): A correct detection where the IOU is equal to or greater than the specified threshold

• False Positive (FP): A false detection where the IOU is less than the specified threshold

• False Negative (FN): A ground truth object that was not detected by the model

• True Negative (TN): In object detection tasks, TN is not commonly used in metrics

The term "TN" refers to the total number of bounding boxes that should not be detected in an image Due to the potential for numerous correctly undetected boxes, TN is rarely used in evaluation metrics.

A comprehensive and visual representation of TP, FP, FN, TN is given in Fig 7

• Precision: Precision measures the proportion of correctly detected objects out of all

16 the objects that were predicted as positive In the context of object detection, precision represents the accuracy of the algorithm in identifying true positives

Recall, often referred to as sensitivity or true positive rate, quantifies the percentage of accurately identified objects compared to all actual positive instances This metric reflects the thoroughness and effectiveness of a detection algorithm in identifying relevant objects.

Formulas for Precision and Recall is shown in (3) and (4):

We adopt two most popular metrics used for object detection used in the famous Pascal VOC Challenge [11]:

The Precision x Recall curve is an essential tool for assessing the performance of object detectors across varying confidence thresholds, with individual curves generated for each object class A high-performing object detector is defined by its ability to sustain elevated predictions as recall rises, indicating that both precision and recall can remain robust with adjustments to the confidence threshold This relationship is illustrated by the equations for precision and recall, where the sum of true positives and false negatives (TP + FN) remains constant; as recall increases due to more true positives, false negatives decrease For precision to stay high, false positives must also decline, reflecting fewer model errors Typically, the curve begins with high precision values that gradually diminish as recall increases.

• Average Precision(AP): It is calculated using area under the curve (AUC) of the

Precision x Recall curve As AP curves are often zigzag curves, comparing different curves (different detectors) in the same plot usually is not an easy task In practice

AP is the precision averaged across all recall values between 0 and 1, implemented in [11] as (5):

Where 𝜌𝜌(𝑎𝑎̃) is the measured precision at recall 𝑎𝑎̃

Vehicle Tracking

This section aims to demonstrate the background of vehicle tracking Given initial measurement of the object state, the tracking module try to assign identity to each vehicle and

Vehicle tracking has evolved significantly with the adoption of the tracking-by-detection (TBD) paradigm, which utilizes advanced object detectors to improve tracking precision TBD focuses on data association, matching detections across sequential frames Two prominent TBD methods under investigation are motion-based tracking and feature-based tracking This article will explore the methodologies of these two approaches and explain the process of data association in detail.

Data association is vital in tracking-by-detection, as it establishes correspondences between detections and existing tracks, ensuring accurate object tracking over time Effective techniques are necessary to address challenges like occlusions, appearance variations, and cluttered environments Two prominent methods used are Intersection over Union (IoU) and the Hungarian matching algorithm IoU measures the overlap between detections and tracks, while the Hungarian algorithm optimally addresses the assignment problem by minimizing the total cost of these associations Together, these methods enhance the robustness and accuracy of tracking systems This article delves into the detailed workings of IoU and the Hungarian matching algorithm within the realm of data association in tracking-by-detection.

In tracking-by-detection, Intersection over Union (IoU) serves as a crucial similarity metric to evaluate the overlap between detection and tracking bounding boxes It is calculated by dividing the area of intersection by the area of union, with higher IoU scores indicating better alignment During the data association phase, detections are matched to existing tracks based on their IoU scores, with the highest scoring detection being assigned to the corresponding track In cases where multiple detections have similar IoU scores, a predefined threshold is applied to filter out less reliable associations, thereby enhancing the overall robustness of the tracking system.

In the context of object detection and tracking, we analyze N detected boxes (D) and M generated track boxes (T) to compute the Intersection over Union (IoU) distance This process results in a cost matrix, denoted as mIoU(D, T), which captures the relationship between each detected box and all generated track boxes.

(11) where, IoU corresponds to the IoU score between chosen detected box D and generated box

The Hungarian algorithm is an optimization technique widely utilized for solving the assignment problem in data association, particularly in tracking-by-detection scenarios This algorithm focuses on determining the optimal assignment of detections to tracks by constructing a cost matrix, where each pair of detection and track is assigned a cost value based on factors like Intersection over Union (IoU), distance, appearance similarity, and motion consistency Lower costs signify better matches, and by employing the Hungarian algorithm, the data association process ensures optimal assignments that enhance tracking accuracy One of the key benefits of the Hungarian algorithm is its ability to provide an optimal solution to the assignment problem while efficiently managing varying numbers of detections and tracks.

Motion-based tracking is a vital area of research with significant applications in surveillance, robotics, sports analysis, and human-computer interaction In particular, vehicle tracking is essential for enhancing transportation systems, improving traffic management, and supporting autonomous driving Accurate vehicle monitoring is crucial for road safety, optimizing traffic flow, and increasing efficiency By effectively tracking vehicles, traffic managers can make better decisions about signal timing, lane management, and route planning, which helps reduce congestion Furthermore, in autonomous driving, reliable vehicle tracking is key for understanding and predicting the behavior of other vehicles, ensuring safe navigation.

The Kalman filter is a recursive estimation technique that effectively combines a mathematical algorithm with a system model to estimate the state of a dynamic system amidst noise and uncertainty It operates in two main phases: prediction and update During the prediction phase, the filter uses the previous state estimate along with the system dynamics model to forecast the current state and its associated uncertainty In the update phase, sensor measurements are integrated to enhance the state estimate by comparing the predicted state with actual data This process involves calculating the Kalman gain, which balances the influence of both the predicted state and the measurement data, resulting in an updated state estimate that more accurately reflects the true system state.

The Kalman filter methodology estimates the state of a dynamic system through a systematic process that includes initialization, prediction, measurement update, and recursion These steps are iteratively applied at each time step to continuously enhance the accuracy of the state estimate, as illustrated in Fig 9.

Fig 9 Schematic Description of the Kalman Filter Algorithm [20]

The Kalman filter starts by initializing the state estimate and covariance matrix, where the state estimate serves as the initial guess of the system's true state and the covariance matrix reflects the uncertainty of this estimate In the prediction step, the filter forecasts the current state based on the previous estimate and the system dynamics model, which outlines how the system changes over time, typically represented by a linear or nonlinear function By integrating the system dynamics, the Kalman filter advances the current state estimate in time while also estimating the uncertainty of this prediction.

The Kalman filter processes the predicted state estimate and covariance matrix, integrating sensor measurements to refine the state estimate During the measurement update phase, it assesses the predicted state against actual data to calculate the Kalman gain, which weighs the influence of the prediction and measurements according to their uncertainties This gain reflects the relative impact of both sources on the updated state estimate, ultimately leading to a more accurate representation of the system's true state by combining the predicted state with measurement information.

In vehicle tracking, the Kalman Filter utilizes an initial state from an object detector and user-defined uncertainty to predict the vehicle's subsequent state in the next frame After making this prediction, it incorporates measurements from the object detector, following data association, to refine the state estimate This update combines the predicted state with measurement information, considering their uncertainties, and is applied recursively at each time step The updated state then serves as the prior estimate for the next prediction, enabling the Kalman Filter to continuously enhance the state estimate with new measurements, thus ensuring real-time tracking of the vehicle's position and velocity.

Feature-based tracking is a widely used technique in computer vision that emphasizes the extraction of distinctive features from target objects to ensure reliable and precise tracking This method goes beyond merely considering an object's appearance or movement; it utilizes discriminative features to create correspondences between consecutive frames, thereby preserving track identities By identifying unique characteristics such as color, texture, or shape, feature-based tracking algorithms offer effective solutions for a range of tracking applications.

DeepSORT is a prominent feature-based tracking algorithm that enhances the traditional SORT (Simple Online and Realtime Tracking) method by integrating a deep appearance embedding network This combination allows for improved tracking accuracy and robustness in various applications.

The article discusses the use of a convolutional neural network (CNN) to extract deep appearance features from object regions in video frames These features facilitate the calculation of similarity scores between detections and existing tracks, enhancing data association and track maintenance By incorporating appearance information alongside motion cues, DeepSORT significantly improves tracking performance Recent advancements in tracking algorithms build upon the foundation laid by DeepSORT.

23 employed feature as an addition for data association while dependance on the Kalman Filter is significant

Tracking datasets are essential for developing and evaluating computer vision algorithms, especially in object tracking They offer labeled image or video sequences with corresponding ground truth annotations, allowing researchers to train and test their algorithms effectively By utilizing these datasets, researchers can benchmark their performance against standardized metrics and compare various approaches, driving advancements in object tracking technology Among these resources, the KITTI dataset is particularly noteworthy for vehicle tracking, as it supports autonomous driving research by providing extensive sensor data from moving platforms and accurate ground truth annotations for vehicles.

Vehicle Behavior Analysis/Trajectory Prediction

This section addresses the challenges inherent in trajectory prediction and offers essential background information to explore contemporary techniques in modern trajectory prediction methods.

Predicting outcomes in the ADAS domain poses a complex challenge due to the following characteristics:

Interdependence among agents, such as vehicles and pedestrians, is essential in understanding their behavior within a shared environment The actions of one agent can significantly impact the trajectories of nearby agents, highlighting the importance of considering the entire scene, including traffic regulations, for accurate trajectory prediction Consequently, trajectory prediction becomes a joint optimization problem that involves all agents in the vicinity, ensuring a comprehensive approach to understanding their interactions.

Developing a deep-learning module for optimizing the trajectories of multiple agents demands significant computational resources However, autonomous vehicles function in real-time, necessitating predictions at approximately 10Hz This creates stringent limitations on the runtime budget for the prediction module.

The prediction module in self-driving software relies on the accuracy of earlier perception and tracking modules, which can introduce accumulated errors Consequently, the effectiveness of the prediction module is directly linked to the performance of these preceding components.

The dynamic nature of autonomous vehicles requires careful consideration of both the ego-vehicle and other agents in motion The future trajectories of these agents are influenced by their own actions as well as the movements of the ego-vehicle To effectively analyze temporal sensor data, it is essential to incorporate ego-vehicle motion compensation Utilizing a Bird's Eye View (BEV) model can help simplify the complexities involved in this process.

Agents in an environment can display multi-modal behavior, indicating that their previous experiences can lead to various potential future paths For instance, a pedestrian entering a crosswalk while the signal is red may choose different actions based on their past interactions and situational awareness.

27 either continue walking or turn around A comprehensive trajectory predictor needs to evaluate all possible trajectories for each event and assign likelihood scores accordingly

According to [29], the prediction task can be divided into two sections:

This section addresses the classification task of predicting an agent's intention, which signifies the expected behavior or action, such as a vehicle being stopped, parked, or in motion This task is generally approached as a supervised learning problem, requiring the annotation of potential intention classifications for the agent.

The trajectory prediction task focuses on forecasting a set of potential future locations, or way-points, for an agent over the next Tpred frames This prediction considers the agent's interactions with other entities in the environment and the road, aiming to determine the likely path the agent will take By analyzing the dynamic interactions and movements of the agent within a defined time frame, this approach provides valuable insights into future behavior.

Modeling trajectory predictions for dynamic road agents requires an understanding of both interaction-awareness and road awareness Specifically, the trajectory of one vehicle, known as Veh.2, is influenced by the trajectory of another vehicle, referred to as Veh.3, highlighting the interdependent nature of their movements.

In trajectory prediction, understanding agent interactions and their intentions is crucial For instance, an approaching car may decelerate if another vehicle attempts to merge onto a congested highway Traditionally, trajectory prediction can be approached through either perspective-view or Bird's Eye View modeling.

The preference for Bird's Eye View (BEV) in modern applications stems from its ability to assign a dedicated distance range in a grid format for the Region of Interest (RoI), unlike image-view, which can theoretically have an infinite RoI due to perspective distortion BEV simplifies the modeling of occlusions, allowing for a more linear representation of motion, and facilitates ego-motion compensation by effectively managing the translation and rotation of the ego-vehicle Additionally, BEV maintains consistent motion and scale for vehicles, ensuring that they occupy the same number of pixels regardless of their distance from the ego-vehicle, a feature not found in image-view.

Accurate predictions of future events hinge on a solid understanding of past behaviors, which can be enhanced through tracker outputs or historically aggregated BEV features A notable trend in trajectory prediction is goal-based prediction, which posits that comprehending an agent's specific goals is essential for anticipating its future actions By integrating the goals of agents into trajectory prediction models, we can significantly enhance the accuracy of forecasts, yielding valuable insights into their subsequent behaviors.

For effective trajectory prediction, large-scale perception datasets such as nuScenes, Waymo Open Dataset, Lyft, and ArgoVerse are utilized, as they provide valuable sequential data However, these datasets often lack annotations for agent intentions, which are crucial for accurate predictions To address this gap, researchers turn to the LOKI dataset, specifically designed to capture agent intentions In scenarios where only sequential unlabeled data is available, trajectory prediction can still be achieved using auto-annotations, which require a robust offline perception and tracking model to detect agents and establish temporal relationships between them.

Intention prediction is a classification task that effectively utilizes loss functions like Binary Cross Entropy and Focal Loss, which enhance the model's ability to accurately classify agents' intentions.

To assess the effectiveness of intent prediction models, several key metrics can be utilized, including precision, recall, F1-score, and mean average precision These evaluation metrics offer valuable insights into the model's accuracy in classifying various intentions and measuring the overall success of the intent prediction process.

The quantitative evaluation is similar to that of the Object Detection task in subsection 2.2.2, equations (3) and (4), where: TP: True Positive; FP: False Positive; FN: False Negative

The main difference is, in this case, the subject matter becomes the accuracy of the predicted intention

MULTI-TASKING MODEL FOR PANOPTIC DRIVING PERCEPTION

System overview

This section—3.1—gives a broad and complete view of our proposed framework For its brevity, it is not assigned to an isolated chapter, in order to increase reader accessibility

As the objectives in section 1.2 suggested, we designed a unified framework that can give timely warnings to driver, an overview of the complete framework is shown in Fig 15

Fig 15 Overview of the complete framework: vehicle detection and scene recognition, vehicle tracking and behavior analysis with enhancing depth information

To address the limitations of previous methods, we developed a system that utilizes depth data from stereo cameras for vehicle detection, tracking, and scene recognition from an egocentric perspective This system processes the tracked states of vehicles to predict their future positions By leveraging depth data and detection results, it calculates the ego-distance to other vehicles, providing timely warnings to the driver if any predicted states or distances breach safety thresholds The operational details of this framework are elaborated in the subsequent chapters.

• The remainder of this chapter is dedicated for the first main component: Multitasking Detection Model;

• Chapter 4 clarifies our newly suggested Vehicle Tracking framework;

• Chapter 5 illustrates the Behavior Analysis and Warning mechanism.

Network Model Selection

To identify which behavior is abnormal or dangerous, our framework needs not only detections of surrounding vehicles for tracking purpose, but also traffic scene information in

33 order to process, understand and analyze the situation properly, avoiding unnecessary cautions that negatively affect driver’s attention

To meet this requirement, it is essential to implement a traffic scene understanding algorithm capable of simultaneously detecting vehicles and performing panoptic driving segmentation, as discussed in the Literature Review of Section 2.2.

YOLOPv2 is a standout model known for its innovative architecture and effective use of a shared decoder, resulting in exceptional performance, as demonstrated both statistically in TABLE I and visually in Fig 16 and Fig 17.

As a result, we employ YOLOPv2 for both perceptive tasks: vehicle detection and driving scene understanding

TABLEI PERFORMANCE COMPARISON AMONG PANOPTIC DRIVING PERCEPTION NETWORKS [14]

Network Size Speed (fps)↑ mAP50 ↑ Recall ↑ Drivable mIoU ↑ Lane IoU ↑

Fig 16 Day time perception result comparison

Fig 17 Night time perceptive result comparison

Network Architecture

The network architecture features a shared encoder that extracts features from the input image, complemented by three specialized decoder heads, each tailored for distinct tasks This section outlines the network configuration, highlighting the arrangement and setup of the model's components.

Our network employs the E-ELAN design, distinguishing it from YOLOP, which uses CSPdarknet as its backbone This innovative design leverages group convolution, enabling various layers to learn a broader spectrum of features, as depicted in Fig 19.

In the neck of the architecture, features from various stages are collected and fused through concatenation, incorporating the Spatial Pyramid Pooling (SPP) module for multi-scale feature fusion, similar to YOLOP Furthermore, the Feature Pyramid Network (FPN) module is utilized to merge features with different semantic levels, enhancing the model's capability to capture and integrate information across varying levels of granularity and semantic complexity.

Fig 18 The network architecture of YOLOPv2

YOLOPv2 utilizes three distinct decoder heads tailored for specific tasks, mirroring the anchor-based multi-scale detection approach found in YOLOv7 The architecture begins with a Path Aggregation Network (PAN) that enhances localization feature extraction through its bottom-up design By integrating PAN features with those from the Feature Pyramid Network (FPN), YOLOPv2 effectively merges semantic information for improved detection performance.

36 with local features This fused feature map is then used for multi-scale detection in the PAN module

In the multi-scale feature map, each grid is associated with three anchors that have varying aspect ratios The detection head is responsible for predicting position offsets, scaling height and width, and providing probabilities and confidence scores for each predicted class.

Drivable area segmentation and lane segmentation are approached as distinct tasks with unique network architectures Unlike YOLOP, which derives features for both tasks from the last layer of the neck, YOLOPv2 employs various semantic feature levels Research indicates that deeper network layers are unnecessary for drivable area segmentation and may impede model convergence during training Consequently, the drivable area segmentation head is linked before the FPN module To mitigate any potential performance loss from this adjustment, an additional up-sampling layer is introduced, utilizing four nearest interpolation up-sampling operations in the decoder stage.

In lane segmentation, the task branch is linked to the final layer of the Feature Pyramid Network (FPN) to capture deeper-level features, which is crucial for accurately detecting road lines that are often wider and difficult to identify in images Furthermore, the application of deconvolution in the decoder stage enhances the performance of lane detection.

VEHICLE TRACKING

System Overview

We present an advanced multi-object tracking framework that prioritizes object appearance over traditional mathematical models for tracking Building on the DiMP tracker, our method effectively analyzes diverse vehicle features like shapes, colors, and textures This enables our framework to adeptly navigate complex real-world situations, including varying motion patterns and occlusions of on-road objects Additionally, we enhance track management by revising SORT and implementing an adaptive thresholding technique tailored for feature-based tracking, significantly minimizing common errors such as out-of-frame handling and feature loss.

Our tracking framework, illustrated in Fig 20, consists of an Object Detector, a Feature-based Tracker, a Data Association mechanism, and a Track Management strategy We emphasize the introduction of adaptive thresholding techniques tailored for enhancing feature-based tracking methods.

Feature-based Tracker

The main goal of the tracker is to determine the object's position using data from the previous frame This paper explores the implementation of the DiMP-50 Tracker concept to enhance tracking accuracy.

[44] to build our feature-based tracker

The tracker’s workflow consists of two primary steps: Initialize and Track In the following parts of this subsection, we delve into the methodology of these components, providing a

Fig 20 System Overview of Tracking Framework

38 detailed overview of their functioning Moreover, an additional aspect of the tracker, which involves bounding box refinement using IoUNet prediction, will be described

The initialization step is crucial for establishing object identifiers upon their first detection, as illustrated in the accompanying figures This process begins with an image containing a newly detected bounding box, which is processed through various data augmentation techniques, including translation, rotation, blur, and dropout, to generate a diverse set of augmented image samples that reflect potential variations in object appearance Subsequently, a feature extractor, specifically ResNet-50, is employed to produce deep feature maps from these augmented samples These feature maps are then paired with the corresponding bounding box centroids to create a set of training samples, denoted as \( D_{gii} = \{(x_j, P_j)\}_{j=1}^{i} \) This training dataset is utilized to train the online object identifier \( f_i \), which comprises a single convolutional layer designed to assess the likelihood of an input window containing an object.

The object identifier 𝑓𝑓 𝑖𝑖 is designed as a compact network that can be trained online for the purpose of being able to continuously and rapidly update changes in the object's appearance

The model will undergo training for 10 epochs on the initial frame, followed by an additional 20 epochs for every 20 frames on the updated training dataset \( D_T \) The object identifier is optimized using a regularized loss function as outlined in equation (20).

Fig 21 Initialize component of tracker.

39 where, 𝜃𝜃 denotes the weight of the object identifier 𝑓𝑓 and 𝜆𝜆 is the regularization factor The function 𝑎𝑎 captures the difference, or errors, between the predicted target confidence score

In our approach, we define the difference between the predicted bounding box centroid 𝑓𝑓(𝑥𝑥) and the ground-truth centroid 𝑃𝑃 at each spatial location as 𝑎𝑎 To tackle data imbalance and improve the model's focus on positive data, we employ a combination of least-squares regression and hinge loss for 𝑎𝑎, following the methodology proposed in [44].

The Track component plays a crucial role in estimating object states in subsequent video frames after initial information is established As illustrated in Fig 22, the system utilizes the bounding box from the previous frame (𝐵𝐵 𝑖𝑖 𝑔𝑔−1) to define the search area for the current frame (t), focusing on the most likely position of the object This approach enhances processing efficiency by extracting features only within the designated search area, allowing the tracker to concentrate on potential object locations Unlike previous methods that employed a fixed scaling factor (S c) for determining the search area, our approach adapts S c based on the object's relative position in the prior frame, as outlined in equation (21).

The adaptive search area scale (AS c) is calculated using the formula 𝑚𝑚𝑇𝑇𝑥𝑥 𝐷𝐷 𝑐𝑐 , where S c is the predefined default scaling factor The variables d x and d y represent the closest distances from the bounding box center in the previous frame to the edges of the input image Additionally, D max indicates the maximum distance from the bounding box center to the edge of the image This framework is essential for optimizing tracking accuracy in visual data analysis.

When the previous frame is near the vertical edge, D max is determined as half the image's width; otherwise, it is half the image's height The adaptive search area scale dynamically modifies the search area as the object nears the image's edge, leading to partial disappearance and changes in the object's appearance By narrowing the search area, the system emphasizes the most pertinent and reliable object features, allowing the tracker to better adapt to changes in appearance and minimize false outputs.

Utilizing the extracted features \( x_t \), the online object identifier \( f_i \) generates a confidence score map that indicates the likelihood of the object's presence throughout the search area The location with the highest score is selected as the centroid of the object's bounding box for the current frame, signifying the most probable position of the object By merging this centroid with the bounding box dimensions from the previous frame, a rough estimation of the object \( o_i \) in the current frame is achieved.

The tracking module effectively locates objects within the frame; however, its accuracy is restricted due to its reliance on a rough estimation based solely on the previous bounding box size To enhance this accuracy, the IoUNet prediction method is utilized to refine the target box After establishing a coarse estimation of the object's location, ten bounding box proposals are generated by introducing uniform noise around this estimation By leveraging features from the previous and current frames, IoUNet predicts the Intersection over Union (IoU) score for each proposal The final bounding box estimation is derived by averaging the three proposals with the highest IoU scores.

To ensure the object identifier remains effective, new training samples will be added to the S train dataset when they meet a satisfactory confidence level To manage memory size, the oldest samples in S train will be removed, allowing for the integration of new data This process keeps the object identifier current and enhances its accuracy in object identification.

Data Association and Track Management

The fundamental necessity of tracking-by-detection techniques lies in accurately linking detections to pre-existing targets Specifically, the output generated by the tracker, which consists of target bounding boxes, must be matched with the bounding boxes identified by an object detector To address this assignment challenge, we employ a methodology akin to that of SORT.

41 employs Intersection over Union (IoU) and the Hungarian matching algorithm as shown in Fig 20 and described in detailed at section 2.3.1

The Hungarian Algorithm processes the cost matrix derived from the IoU matrix and formula, culminating in three key outputs: matched tracks (Tm), unmatched tracks (Tu), and unmatched detections (Du) Each matched track (Tm) signifies pairings of detections (D) and tracks (T) with the highest IoU scores Unmatched tracks (Tu) are those with overlaps below a specified IoU threshold, while unmatched detections (Du) pertain to new object detections that have not been paired with any tracks.

Our framework utilizes a modified track handling approach inspired by SORT and DeepSORT, effectively managing the sets of TT mm, TT uu, and DD uu Furthermore, we introduce an adaptive thresholding method tailored for our system, which can also be applied to other feature-based tracking systems.

New tracks are initialized when there is a set of unmatched detections, denoted as \(D_u\) Each element in \(D_u\) is assigned as a new track \(T\), which includes essential parameters such as the bounding box center position \(c\), aspect ratio \(\gamma\), height \(h\), and track management parameters like track age \(a\) and track hit count \(b\) This process ensures effective track management in detection systems.

𝑏𝑏, and predefined search area scale 𝐷𝐷 𝑐𝑐

After initialization, each tracking target (𝑇𝑇) undergoes data association with detected boxes in the following frame, incrementing the counter (𝑏𝑏) upon successful associations Conversely, unsuccessful associations lead to increments in parameters 𝑎𝑎 and 𝐷𝐷𝑐𝑐 to account for potential object re-entries, which reset to their initial values if a track successfully rematches with a detected box Tracks with 𝑎𝑎 values exceeding a maximum age threshold (AMax) are deemed to have permanently left the frame and are subsequently rejected Unlike SORT and DeepSORT, which utilize a fixed age threshold, our approach introduces an adaptive thresholding method tailored for feature-based trackers to reduce associated errors.

AMax is defined similarly to AS c as described in (21) By employing the Adaptive Maximum Age mechanism, when a vehicle's trajectory starts to extend beyond the frame boundaries,

AMax progressively decreases, allowing for the swift removal of tracks This adaptive strategy optimizes the tracker's ability to manage objects that exit the frame, thereby improving overall tracking efficiency.

In tracking applications, it is essential to account for scenarios where objects are visible only in the initial frames As outlined in [21], tracks are deemed tentative during their first few frames and are only confirmed after a specified duration.

The method involves achieving 42 consecutive successful associations, ensuring that tentative tracks are eliminated immediately after any unsuccessful association By enforcing the requirement for consecutive matches, this approach effectively reduces errors associated with false or transient detections.

VEHICLE BEHAVIOR ANALYSIS AND DRIVER WARNING

System Overview

Excessive and irrelevant alarms can distract drivers and create more harm than benefit To mitigate this issue, our focus is on analyzing the ego-vehicle's lane, specifically monitoring the behaviors of vehicles that may pose a risk, such as those merging from adjacent lanes or decelerating within the same lane An overview of the warning mechanism is illustrated in Fig 23.

Fig 23 Overview of the Behavior Analysis and Warning component

To meet the specified requirements, our warning mechanism must first extract the ego-lanes and ego-driving area, followed by inferring the future states of surrounding vehicles The extraction of ego-lanes is achieved through Lane Processing, detailed in section 5.2, while the future state inference is performed using a Kalman filter, as outlined in section 5.3 The comprehensive analysis and warning process is further elaborated in section 5.4.

Post Processing of Multitasking Detection Model output

The default Multitasking Detection Model analyzes the entire image frame, which is not ideal for our needs To address this, we utilize the Hough transform and other algorithms in Computer Vision to create a post-processing pipeline that effectively extracts the ego-lane and ego-drivable area, as illustrated in Fig 24.

The detected lane lines and driving areas are represented as mask images consisting of binary pixels To extract meaningful lines from these masked lane images, the Hough transform is applied, enabling the conversion of the mask into identifiable line segments for subsequent processing.

The Hough transform generates an excessive number of lines, which can be unmanageable Therefore, it is essential to merge lines with similar slopes and intercepts to streamline the results.

The post-processing pipeline for the Multitasking Detection Model identifies separate road-lane lines while filtering out horizontal lines, which are later analyzed to determine if the scene is at an intersection based on the ratio of horizontal to vertical lines exceeding a preset threshold To focus solely on ego-lane lines, the distance between these lines and the ego-position is calculated, and the two lines with the smallest distance are selected as the ego-lane lines.

In typical driving scenarios, lane lines are often only partially visible To address this issue and complete the missing sections, we employed a linear line fitting algorithm to extrapolate the existing line segments toward the lower edge of the image frame and the frame's centroid, which is close to the vanishing point.

5.2.4 Establish Warning Region of Interest

The Warning Region of Interest (WROI) defines the ego-drivable area by integrating ego-lane lines and the drivable area generated by a multitasking detection model This WROI acts as the focal point for our warning mechanism, which assesses the behavior of surrounding vehicles to determine if their predicted states intersect with this area Details on how to obtain these predicted states are provided in section 5.3.

Predict Future State using Kalman Filter

The Kalman Filter plays a crucial role in managing measurement noise and uncertainty, as discussed in subsection 2.3.3 By utilizing only the prediction step of this technique, we can effectively predict the future states of objects, providing valuable insights into their behavior This process yields predicted information about nearby vehicles, with the estimated state represented as their relative position within the image frame, as demonstrated in the lower processing branch of Fig 23.

A visual explanation of our predicting method is given in Fig 25

Fig 25 Applying Kalman Filter prediction to determine the next positions of tracked vehicles Estimated positions are shown in white bounding boxes, while the current state in yellow ones

The Kalman prediction method demonstrates its effectiveness, as evidenced by the nearly identical predicted and actual positions of vehicle ID 93 (the white vehicle) In contrast, vehicle ID 1 displays a single white bounding box due to its predicted state overlapping with its current state, despite actually producing two boxes similar to vehicle 93 The occlusion of the yellow box by the white one indicates that vehicle ID 1's relative position in the image frame remains unchanged, a fact that can be confirmed by measurements taken over three consecutive frames.

Analyzing and Warning

Revising the overview of the entire framework (Fig 15) and the overview of the Behavior Analysis and Warning component (Fig 23) gives an explanation of our analyzing process Our inputs are:

2) Ego vehicle driving lane from lane post-processing;

3) Predicted positions of other vehicles

We then evaluate traffic scenes, situations that receive warning marks include:

• Case 1: Distance decrease over a given consecutive frames – Decelerating;

• Case 2: Predicted position intersects ego vehicle driving lane – Lane-changing;

• Case 3: Ego vehicle enters intersections

When abnormal car behavior is detected, the system highlights it in red, alerting the driver to exercise greater caution This driver assistance feature significantly enhances safety and has the potential to reduce accident rates The subsequent sections will outline the principles behind three specific cases, with visual results of the warning mechanism presented in section 6.2.

Fig 26 is an illustration of deceleration warning

In Fig 26, the upper images present the driver's view, while the lower images illustrate the processing involved in analysis and explanation Key conventions include the green area representing the ego-vehicle's driving zone, the brown area indicating other driving zones, and the blue area denoting road lanes Vehicles marked with a warning flag are highlighted in red masking These conventions are consistently applied throughout sections 5.4 and 6.2.

Fig 26 The vehicle with decelerating action in the ego-lane is marked as warning

The ego-vehicle lane information is essential for identifying vehicles within the ego-lane, with their distances being continuously monitored across consecutive frames If the estimated distance decreases over time while a vehicle remains in the ego-lane, it triggers a warning Additionally, the system provides the estimated distance during such warnings (refer to [5, Fig]).

Using ego-lane and vehicle detection data, we can identify vehicles that remain outside the designated ego-lane During lane changes from adjacent lanes to the ego-lane, vehicles typically move toward the ego-lane, leading to predicted positions that encroach upon this area If a vehicle in an adjacent lane is predicted to intersect the ego-lane, it is classified as executing a lane change, prompting a warning for that vehicle Figure 27 illustrates the warnings issued to other vehicles during their lane-changing maneuvers.

Fig 27 The vehicle performing lane-changing to ego-lane is marked as warning

Enhancing driver attention at intersections is crucial for safety, as drivers must remain vigilant of surrounding vehicles Therefore, it is essential for the system to alert all vehicles present in any drivable area.

Scene recognition at intersections is accomplished by analyzing the appearance of lane lines and calculating the ratio of horizontal to vertical lines Extensive testing demonstrates that our method effectively generalizes across various intersection scenarios.

Fig 28 At intersections, all the vehicles in any drivable area are marked as warning

As shown in Fig 28, our system generates warning on multiple vehicles that potentially involve in the driving area This alerting mechanism helps the driver stay focus on intersection maneuvers

RESULTS

Vehicle Tracking

For comparison, we also reproduced two popular motion-based tracking methods that employ similar association strategies as our proposed method, namely DeepSORT [21] and SORT [22]

TABLEII EVALUATION RESULTS OF THREE METHODS ON MOTAMETRIC

TABLEIII E VALUATION R ESULTS O F T HREE M ETHODS O N IDF1M ETRIC

Our evaluation results, presented in TABLE II and TABLE III, demonstrate significant improvements in tracking performance using our method, as reflected in the overall MOTA score Specifically, our approach achieved the highest rankings in both false negatives (FN) and false positives (FP), outperforming the second-best method, SORT, with a minor reduction in FN.

The proposed method significantly improved performance, reducing FP from 775 to 465, and excelled in key parameters such as Mostly Track Ratio (MT), Mostly Lost Ratio (ML), and Tracking Precision (MOTP) Additionally, it demonstrated superiority at the association level, achieving a higher IDF1 score thanks to a notably lower IDFP value compared to the other methods.

TABLEIV EVALUATION RESULTS OF THREE METHODS ON HOTAMETRIC

In conclusion, the enhancements discussed validate the accuracy and reliability of our approach The results presented in Table IV, particularly in HOTA metrics, demonstrate that our performance in both detection (DetA) and association (AssA) surpasses that of SORT and DeepSORT.

MOTA↑ MOTP↑ MT↑ ML↓ FN↓ FP↓ IDSW↓ Frag↓

Behavior Analysis and Warning

The provision of warnings to drivers is not categorized under the metrics outlined in subsection 2.4.4 Additionally, our system currently lacks a classification head, making quantitative evaluation impossible at this stage To enhance understanding of our warning mechanism's output, we present visual performance in Figures 29-32.

Our results on deceleration warnings are illustrated in Fig 29, where a continuous decrease in distance to the target vehicle is observed by the third frame, indicating its deceleration and triggering a warning The warning is lifted by the sixth frame when the distance begins to increase again A similar pattern is noted in Fig 30.

Fig 29 A tracked vehicle is marked as warning due to its deceleration based on distance data

Fig 30 Additional results of behavior analysis and warning

Case 2: Warning of Lane-changing

Figure 31 illustrates the effectiveness of our warning mechanism for vehicle lane-changing over a sequence of 12 frames The warning is activated in the third frame when the system detects that the predicted position of the target vehicle encroaches upon the ego-lane This warning persists for 8 frames as the target vehicle continues to approach the ego-lane In the final two frames, the warning is deactivated, indicating that the target vehicle's predicted position no longer intersects with the ego-lane.

Fig 31 A vehicle whose predicted position intersects ego-lane area is being warned

A sample result of the warning system at intersection is shown in Fig 32

Fig 32 At intersection, any vehicle inside drivable area is warned

CONCLUSION AND FUTURE WORK

After being accomplished, the main work of this project is summarized:

• Data from multiple imaging sensors are combined to produce depth information for distance estimation, then the estimated distance is employed to analyze target behavior;

• Successful integration of a fine-tuned Panoptic Driving Multitasking Model for vehicle detection and scene recognition tasks; the model’s output is further processed to better suit the project requirements;

A newly designed reliable tracking framework for vehicle tracking offers significant applications in Advanced Driver Assistance Systems (ADAS), Autonomous Vision, and Vehicle-to-Infrastructure (V2I) technologies Extensive evaluation on the established KITTI benchmark demonstrates that our proposed method outperforms existing approaches, showcasing superior tracking accuracy and robustness.

• A Behavior Analysis System that studies vehicle behavior and gives meaningful warning, increasing driver attention during complex driving scenarios;

• Introduction of a comprehensive framework that integrates vehicle detection and scene recognition, vehicle tracking, and behavior analysis and warning to enhance road safety and provide intelligent driver assistance

Despite achieving the target objectives, this project remains some key limitations to be overcome in the future:

• Improvement of state prediction for Tracking Framework;

• Introduction of scene tracking for stable scene recognition;

• Solutions for real-time applications and hardware constraints;

• More robust processing in accident prediction and warning

[1] Mohamed Shawky, “Factors affecting lane change crashes”, IATSS Research,

Volume 44, Issue 2, 2020, Pages 155-161, ISSN 0386-1112, DOI: 10.1016/j.iatssr.2019.12.002

[2] A Geiger, P Lenz, C Stiller, and R Urtasun, “Vision meets robotics: The KITTI dataset,” Int J Rob Res., vol 32, no 11, pp 1231–1237, 2013

[3] Yu, Fisher, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu,

Vashisht Madhavan, and Trevor Darrell “BDD100k: A diverse driving dataset for heterogeneous multitask learning,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2636-2645 2020

[4] C Premebida, G Monteiro, U Nunes, and P Peixoto, “A lidar and vision-based approach for pedestrian and vehicle detection and tracking,” IEEE Intelligent

[5] R O Chavez-Garcia and O Aycard, “Multiple sensor fusion and classification for moving object detection and tracking,” IEEE Transactions on Intelligent

Transportation Systems, vol 17, no 2, pp 525–534, 2016

[6] P Dollar, C Wojek, B Schiele, and P Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol 34, no 4, pp 743–761, 2012

[7] H Cho, Y.-W Seo, B V Kumar, and R R Rajkumar, “A multisensory fusion system for moving object detection and tracking in urban driving environments,”

IEEE International Conference on Robotics and Automation (ICRA), 2014, pp

[8] F Garcia, D Martin, A de la Escalera, and J M Armingol, “Sensor fusion methodology for vehicle detection,” IEEE Intelligent Transportation Systems

[9] “Stereo disparity using semi-global block matching,” Stereo Disparity Using Semi-

Global Block Matching - MATLAB & Simulink - MathWorks, [Online] Available: https://ww2.mathworks.cn/help/visionhdl/ug/stereoscopic-disparity.html (accessed Jun 30, 2023)

[10] “CV::STEREOSGBM class reference,” OpenCV Documentation, [Online]

Available: https://docs.opencv.org/3.4/d2/d85/classcv_1_1StereoSGBM.html

[11] Mark Everingham, Luc Gool, Christopher K Williams, John Winn, and Andrew

Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” Int J Comput

Vision 88, 2 (June 2010), 303–338 https://doi.org/10.1007/s11263-009-0275-4

[12] K He, G Gkioxari, P Dollár and R Girshick, "Mask R-CNN," 2017 IEEE

International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp 2980-

[13] S Ren, K He, R Girshick and J Sun, "Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 39, no 6, pp 1137-1149, 1 June 2017, doi:

[14] W Zhou, Y Zhu, J Lei, R Yang and L Yu, "LSNet: Lightweight Spatial Boosting

Network for Detecting Salient Objects in RGB-Thermal Images," IEEE Transactions on Image Processing, vol 32, pp 1329-1340, 2023, doi: 10.1109/TIP.2023.3242775

[15] M Teichmann, M Weber, M Zửllner, R Cipolla and R Urtasun, "MultiNet: Real- time Joint Semantic Reasoning for Autonomous Driving," 2018 IEEE Intelligent

Vehicles Symposium (IV), Changshu, China, 2018, pp 1013-1020, doi:

[16] S D Yashwanth, S V Rao, Rakshit, Y P Meharwade and R Kivade, "Autonomous

Driving Using YOLOP," 2022 IEEE North Karnataka Subsection Flagship

International Conference (NKCon), Vijaypur, India, 2022, pp 1-6, doi:

[17] Dat Vu, Bao Ngo, and Hung Phan, “Hybridnets: End-to-end perception network,” arXiv preprint arXiv: 2203.09035, 2022

[18] Han, Cheng, Qichao Zhao, Shuyi Zhang, Yinzi Chen, Zhenlin Zhang, and Jinwei

Yuan, “YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception,” arXiv

[19] H W Kuhn, “The Hungarian Method for the Assignment Problem,” Naval Research

Logistics Quarterly, vol 2, no 1 2, pp 83 97, 1955

[20] R., E., Kalman, “A New Approach to Linear Filtering and Prediction Problems,”

Journal of Basic Engineering, 82(1):35-45 doi: 10.1115/1.3662552, 1960

[21] N Wojke, A Bewley, and D Paulus, “Simple online and Realtime Tracking with a deep association metric,” arXiv [cs.CV], 2017

[22] A Bewley, Z Ge, L Ott, F Ramos, and B Upcroft, “Simple Online and Realtime

[23] K Bernardin and R Stiefelhagen, “Evaluating multiple object tracking performance:

The CLEAR MOT metrics,” EURASIP J Image Video Process., vol 2008, pp 2008

[24] J Luiten, A Os̆ep, P Dendorfer, P Torr, A Geiger, L Leal-Taixé and B Leibe,

“HOTA: A higher order metric for evaluating multi-object tracking,” Int J Comput

[25] E Ristani, F Solera, R S Zou, R Cucchiara, and C Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” arXiv [cs.CV],

[26] Y Zhang, P Sun, Y Jiang, D Yu, F Weng, Z Yuan, P Luo, W Liu and X Wang

, “ByteTrack: Multi-object tracking by associating every detection box,” arXiv

[27] J Cao, J Pang, X Weng, R Khirodkar, and K Kitani, “Observation-centric SORT:

Rethinking SORT for robust multi-object tracking,” arXiv [cs.CV], 2022

[28] Y Du, Z Zhao, Y Song, Y Zhao, F Su, T Gong and H Meng, “StrongSORT:

Make DeepSORT Great Again,” arXiv [cs.CV], 2022

[29] Izzeddin Teeti, Salman Khan, Ajmal Shahbaz, Andrew Bradley, and Fabio Cuzzolin,

This survey discusses vision-based intention and trajectory prediction in autonomous vehicles, highlighting advancements presented at the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) It emphasizes the importance of accurately predicting driver intentions and vehicle trajectories to enhance safety and efficiency in autonomous driving systems The findings presented by Lud De Raedt in 2022 contribute to the ongoing research in artificial intelligence applications within the automotive industry, showcasing innovative methodologies and their implications for future vehicle technologies.

Conferences on Artificial Intelligence Organization, Survey Track

[30] Yookhyun Yoon, Taeyeon Kim, Ho Lee, and Jahnghyon Park, “Road-aware trajectory prediction for autonomous driving on highways,” Sensors, vol 20, no 17,

[31] Stefano V Albrecht, Cillian Brewitt, John Wilhelm, Balint Gyevnar, Francisco Eiras,

Mihai Dobre, and Subramanian Ramamoorthy, “Interpretable goal-based prediction and planning for autonomous driving,” in 2021 IEEE International Conference on

Robotics and Automation (ICRA) IEEE, 2021, pp 1043–1049

[32] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong,

Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom,

“nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the

IEEE/CVF conference on computer vision and pattern recognition, 2020, pp 11621–

[33] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai

Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al.,

“Scalability in perception for autonomous driving: Waymo open dataset,”

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp 2446–2454

[34] John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain,

Sammy Omari, Vladimir Iglovikov, and Peter Ondruska, “One thousand and one hours: Self-driving motion prediction dataset,” Conference on Robot Learning

[35] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak,

Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al.,

“Argoverse: 3d tracking and forecasting with rich maps,” Proceedings of the

IEEE/CVF conference on computer vision and pattern recognition, 2019, pp 8748–

[36] Harshayu Girase, Haiming Gang, Srikanth Malla, Jiachen Li, Akira Kanehara,

Karttikeya Mangalam, and Chiho Choi, “Loki: Long term and key intentions for trajectory prediction,” Proceedings of the IEEE/CVF International Conference on

[37] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar, “Focal loss for dense object detection,” Proceedings of the IEEE international conference on computer vision, 2017, pp 2980–2988

[38] S Verma et al., "Vehicle Detection, Tracking and Behavior Analysis in Urban

Driving Environments Using Road Context," 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 2018, pp 1413-

[39] Yao, Yu, Mingze Xu, Chiho Choi, David J Crandall, Ella M Atkins, and Behzad

Dariush "Egocentric vision-based future vehicle localization for intelligent driving assistance systems," 2019 International Conference on Robotics and Automation (ICRA), pp 9711-9717 IEEE, 2019

[40] Y Yao, M Xu, Y Wang, D J Crandall and E M Atkins, "Unsupervised Traffic

Accident Detection in First-Person Videos," 2019 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), Macau, China, 2019, pp 273-

[41] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence 37, no 9 (2015): 1904-1916

[42] Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and

Serge Belongie, “Feature pyramid networks for object detection,” Proceedings of the

IEEE conference on computer vision and pattern recognition, pp 2117-2125 2017

[43] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia, “Path aggregation network for instance segmentation,” Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018

[44] G Bhat, M Danelljan, L Van Gool, and R Timofte, “Learning discriminative model prediction for tracking,” arXiv [cs.CV], 2019

[45] M Danelljan, G Bhat, F S Khan, and M Felsberg, “ATOM: Accurate tracking by overlap maximization,” arXiv [cs.CV], 2018

[46] “Hough Line transform,” OpenCV Documentation, [Online] [Available]: https://docs.opencv.org/3.4/d9/db0/tutorial_hough_lines.html (accessed Jul 1,

Tiêu đề	Vehicle Detection, Tracking and Behavior Analysis with Enhancing Depth Information
Tác giả	Ha Phan Ngoc Quan, Truong Thanh Nguyen
Người hướng dẫn	Nguyen Trung Hieu, MSc.
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Automotive Engineering
Thể loại	Graduation Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	71
Dung lượng	6,73 MB