00051001897 apply graph neural network for driver activity recognition from multiple cameras

00051001897 apply graph neural network for driver activity recognition from multiple cameras 00051001897 apply graph neural network for driver activity recognition from multiple cameras

Motivation

Recognizing driver activities through video analysis is a key focus in automotive research, aiming to improve driver safety and experience [1] This process involves classifying activities into low-level and high-level categories, utilizing computer vision techniques with single or multiple cameras to monitor driver behavior and vehicle control actions [2] The application of this technology has significant implications for promoting safe driving practices and reducing accidents [3].

Activity recognition has advanced significantly through deep learning, efficiently managing both simple and complex activity sets [4] Building on foundational architectures such as VGG [5] and ResNet [6], researchers have created sophisticated models capable of precise classification from input images However, these models often demand high computational resources, limiting their suitability for real-time driver monitoring systems To address this, the development of video-based approaches has led to innovative architectures that effectively capture temporal relationships between frames, enhancing activity recognition accuracy.

– The 3D convolution network [7], which extends traditional convolution operations from single images to image sequences

– The SlowFast architecture [8], implementing a dual-stream approach for efficient feature extraction

– The temporal segment network [9], developed to capture long-range relationships between video frames

Driver activity recognition faces unique challenges due to frequent occlusions in vehicle environments, where constrained camera positions often result in partial or complete obstruction of driver actions These occlusions significantly reduce recognition accuracy, making it difficult for traditional deep learning methods to maintain performance Addressing these challenges is essential to improve driver behavior monitoring systems in automotive safety applications.

Researchers have addressed challenges in human pose estimation by exploring keypoint-based approaches that utilize pose estimation and graph representations, as demonstrated in studies [11] and [12] Graph neural networks enhance prediction accuracy by effectively propagating information across different body parts, even when some are occluded Numerous frameworks for extracting human poses from images and videos [13, 14] provide a solid foundation for these graph-based methods This alternative representation not only mitigates occlusion issues but also offers computational advantages over traditional 3D convolutional and temporal segment networks, making it ideal for real-time driver monitoring systems.

Contributions

Our research introduces a novel two-stream architecture that effectively combines image-video streams with graph-based pose streams The key contributions of our work include:

1 Development of an Image stream that integrates image feature extraction with temporal convolution, capturing dynamic relationships between frames while em- ploying strategic sampling to manage computational complexity

2 Implementation of a Pose stream that converts individual poses into graph representations, combining them into a dynamic graph structure processed through spatio-temporal graph convolution

3 Creation of a unified architecture that effectively combines both streams to produce robust activity recognition

We validated our approach using the AI-City Challenge 2023 dataset [10], testing across multiple camera views Our results demonstrate:

– Consistent performance improvements of 1-3% in accuracy across all three tested views (dashboard, rear, and right side window)

– Substantial error reduction (10-15%) in challenging scenarios involving occlusion,particularly in the rear view and right side window view configurations

Background Knowledge 4

Action Recognition

Video activity recognition is an important task in image processing and computer vision, attracting significant attention due to its wide range of applications [4, 15].

Human action recognition in video sequences involves identifying, locating, and predicting human actions, playing a vital role in fields like human behavior analysis, video retrieval, human-robot interaction, and entertainment systems Despite its importance, effective action recognition faces challenges such as managing large volumes of video data, which consist of sequential frames leading to increased input sizes compared to images Variations in video length, quality, and resolution further complicate the process, while acquiring sufficient training data remains time-consuming and resource-intensive.

To tackle challenges in action recognition, researchers have created diverse benchmark datasets that evaluate algorithm performance across various scenarios and difficulty levels Notable datasets such as UCF-101 and HMDB51 serve as essential platforms for benchmarking and advancing action recognition techniques.

Figure 2.1: Example classes from the Kinetics dataset [18]

Effectively recognizing actions in videos requires addressing two fundamental questions:

- Modeling Temporal Information: How can we effectively capture and model the temporal information embedded in video sequences? This involves understanding the relationships between frames and identifying patterns across time.

To enhance computational efficiency in video processing, it is essential to reduce computational complexity without sacrificing classification accuracy Given the significant resource demands of video processing models, optimizing algorithms to minimize computational overhead is critical for their practical deployment and real-world applications.

Before the rise of deep learning, action recognition primarily depended on handcrafted features that capture object motion, such as Stacked Fisher Vectors and silhouette-based methods These techniques effectively identified key movement characteristics but were computationally intensive and difficult to scale for real-world applications Classification was typically performed using machine learning algorithms like Support Vector Machines (SVM), which, despite producing reasonable results, further added to the complexity and resource requirements of traditional approaches.

Figure 2.2: A chronological overview of recent representative work in video action recognition from 2014 to 2020 [21]

Since 2014, the field of action recognition has experienced significant progress due to the integration of deep learning techniques, especially Convolutional Neural Networks (CNNs) These advancements have revolutionized video analysis by enabling more efficient and accurate recognition of human actions Deep learning models have transformed traditional methods, making action recognition more reliable and scalable, as shown in Figure 2.2 The primary approaches in the field can be broadly classified into three main categories, reflecting diverse strategies for analyzing video data. -🌸 **Ad** 🌸 Boost your action recognition research with [Claude](https://pollinations.ai/redirect/claude), the AI assistant designed for deep learning breakthroughs!

Combining Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs) is a popular approach for action recognition, leveraging the idea that video sequences can be modeled as a series of consecutive frames This architecture typically employs CNNs to extract rich spatial features from each frame, which are then fed into RNNs, such as LSTM or GRU, to capture the temporal dynamics of the video Notably, the Long-term Recurrent Convolutional Networks (LRCN) pioneered this approach by integrating CNNs with LSTMs, demonstrating the effectiveness of combining spatial analysis with long-term temporal memory for improved action recognition.

Figure 2.3: CNN and LSTM combine architecture [22]

3D CNNs have become a powerful tool for video processing tasks by effectively modeling both spatial and temporal information Unlike traditional 2D CNNs that analyze only spatial dimensions (height and width), 3D CNNs incorporate an additional dimension representing time This ability makes 3D CNNs particularly suitable for applications such as action recognition, video classification, and motion detection, enhancing their overall performance in understanding dynamic visual content.

The core concept of 3D CNNs involves processing input tensors that capture sequences of video frames over time Typically, a video clip with L frames of size H×W is represented as a tensor with dimensions (L, H, W, 3), where the final dimension corresponds to RGB color channels To optimize this data for 3D CNN processing, the tensor is reshaped to (3, L, H, W), with L representing the temporal dimension and H and W representing the spatial dimensions This format enables effective learning of spatiotemporal features in video analysis.

3D CNNs offer the significant advantage of capturing and modeling spatio-temporal information simultaneously, leading to improved performance in video processing tasks compared to traditional 2D CNNs that lack the temporal dimension By leveraging 3D convolutions, these models can learn both temporal dynamics and spatial features at once, resulting in more accurate and robust predictions However, despite their effectiveness, 3D CNNs also present several challenges.

1 Computational Complexity: 3D CNNs are inherently more complex than 2D CNNs due to the additional temporal dimension This complexity results in higher computational costs and longer training times, often requiring weeks of training on large datasets.

2 Data Requirements: Training a 3D CNN requires a substantial amount of labeled video data to achieve good performance Collecting and annotating such large and diverse datasets is both time-consuming and resource-intensive.

3 Parameter Optimization: The increased complexity of 3D CNN makes it challenging to optimize their parameters effectively Finding the right balance of hyperparameters often requires extensive experimentation and fine-tuning.

4 Lack of Pretrained Models: Unlike 2D CNNs, which benefit from pretrained models on large-scale image datasets like ImageNet, 3D CNNs have histori- cally lacked such pretrained models This limitation restricts their ability to generalize well without extensive training However, the introduction of the Inflated 3D (I3D)

The 24-model innovation enables the transfer of pretrained weights from 2D CNNs to 3D CNNs, greatly enhancing their performance and usability across various video processing tasks.

Key research papers in this area include "Learning Spatiotemporal Features with 3D Convolutional Networks (C3D)," which advances motion recognition through 3D convolutions; "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D)," offering a novel approach leveraging the extensive Kinetics dataset for improved accuracy; and "TAM: Temporal Adaptive Module for Video Recognition," introducing adaptive temporal modules to enhance action recognition performance in videos.

Two-stream networks, introduced in the influential paper "Two-stream convolutional networks for action recognition in videos," offer a innovative approach to video recognition by utilizing two separate input streams These networks process RGB frames for spatial content and optical flow for capturing motion and temporal dynamics, both using identical architectures The combined prediction is achieved through late fusion of the outputs from both streams, enhancing the accuracy of action recognition in videos.

Distracted Driving Action Recognition

Distracted driving recognition has become a vital focus in action recognition due to its potential to improve road safety and lower traffic accidents Recent advancements, especially in deep learning techniques, have significantly propelled this field forward.

Several landmark studies have contributed to this domain’s development:

The DriveAct dataset offers a comprehensive collection of 12 distraction-related actions captured from six different camera angles, providing valuable data for research This study highlights the effectiveness of combining video analysis with body pose estimation to improve action recognition accuracy, demonstrating the potential of such integrated approaches in driver distraction detection.

– The Driver Anomaly Detection (DAD) dataset [29] introduced a novel approach using contrastive learning to distinguish between normal and abnormal driving behaviors, incorporating data from 31 subjects in real-world driving conditions.

– Research by Eraqi et al [30] explored the adaptation of existing video recognition models through large-scale fine-tuning, specifically targeting distracted driving recognition.

Methodology 20

Problem Statement

Driver activity recognition is a multi-class classification task aimed at identifying specific driver behaviors, such as texting, calling, drinking, or adjusting vehicle controls, from in-vehicle video sequences The model predicts the driver’s activity class (ŷ ∈ Y) based on input videos, facilitating real-time understanding of driver actions A key challenge in this domain is achieving accurate recognition when the driver is partially occluded from the camera’s view, which is common in real-world driving environments Effective driver activity recognition enhances vehicle safety systems by monitoring driver behavior and identifying potentially hazardous actions.

Overall Architecture

In the challenging field of driver activity recognition, scene occlusion poses a significant obstacle by obscuring parts of the driver and introducing noise into image features, which hampers accurate identification To overcome this key challenge, we have developed a novel two-stream architecture that leverages multiple perspectives to improve recognition performance under occlusion conditions.

Our architecture features two parallel processing branches designed to analyze various facets of driver behavior The first, the Image module, processes raw visual data from video streams to capture contextual visual information The second, the Pose module, focuses on analyzing the driver’s posture and movement patterns for behavioral insights An overview of this dual-branch architecture is illustrated in Figure 3.1, highlighting its comprehensive approach to driver monitoring.

Figure 3.1: The overview of our proposed two-stream architecture for combination between the Image stream and the Pose stream.

To understand how our system processes information, let’s first examine the structure of our input data The video streams entering our system are represented as a tuple of [T, n_channels,R, C ], where:

– T represents the number of image frames

– n_channels indicates the number of channels in each frame

– R and C specify the number of rows and columns in the input image

In our implementation, we fix the value of T because each action occurs within a predefined time window, ensuring accurate data capture To maintain simplicity and clarity in the analysis, overlapping regions between consecutive actions are not considered, streamlining the process for more precise results.

Our architecture features two complementary branches, with the Image branch responsible for processing raw image data and capturing spatio-temporal relationships To optimize computational efficiency, this branch utilizes a "sparse stream," operating on a carefully selected subset of images, which reduces processing overhead This approach results in a robust hidden representation, h I, that is more resilient to occlusion-related noise, enhancing the overall performance of our model.

The Pose stream offers a unique approach by utilizing a graph-based representation of the driver’s posture, making it a "dense stream" due to its lightweight dynamic graph structure compared to traditional image-based features This method effectively captures essential information about the driver’s movements through its own hidden representation, h P, enabling more efficient and focused posture analysis.

The Output Module integrates two distinct streams by combining their respective representations to form a comprehensive joint representation Since I and P typically have different dimensions due to their unique feature extraction capabilities, the module effectively merges these into a unified feature vector This combined representation, denoted as h, is then processed through multiple feed-forward layers to accurately recognize specific driver activities This process enhances activity recognition accuracy by leveraging the complementary information from both streams.

This article explores the key components of the architecture responsible for robust driver activity recognition We will analyze how each element functions and collaborates to accurately identify driver behaviors Despite challenges such as occlusion and noise, these components work together to ensure reliable performance By understanding these integral parts, we can appreciate how the system maintains high accuracy in complex real-world scenarios This comprehensive approach highlights the effectiveness of the architecture in overcoming common obstacles in driver activity monitoring.

Image Module

Figure 3.2: The overview of the Image Module which contains a preprocessing step, a 2D convolution, and 3D convolution operator for extracting features from the sequence of images

Our Image Stream component is designed with an efficient, lightweight architecture that effectively captures both spatial and temporal features from image sequences, essential for robust image analysis It integrates three key modules: a preprocessing stage for image sampling, a 2D feature extraction phase, and a 3D temporal convolution layer, enabling comprehensive analysis of image data This streamlined pipeline, as illustrated in Figure 3.2, ensures optimal performance in extracting meaningful features for our system.

The input to our Image Module is a sequence of images, denoted as:

S = [I0,I1, ,I T−1 ] (3.1) where each frame I k has the shape [n_channels, R, C] This sequence undergoes processing to produce a video stream representation of shapeT×D − i, withTrepresenting the frame count and D − i representing the image data dimensionality.

To optimize computational efficiency, we implement a strategic sampling approach By selectingτ images with a sampling frequency of k=T /τ, we create a new, more manageable sequence:

Our sampling strategy effectively reduces computational overhead while capturing the most relevant frames for accurate predictions After sampling, we perform additional preprocessing steps, such as converting frames to grayscale, to streamline the input data and enhance processing efficiency.

Our feature extraction process begins with a pretrained backbone network f θ This network processes each sampled image I at timesteps t = [k0, k0 +k, ], generating hidden features through the transformation:

The resulting output H t takes the form of a 2D feature map with dimensions

R I ×C I We then organize these feature maps into a sequential structure:

The final stage of our Image Module focuses on capturing temporal relationships within the feature sequence to enhance spatial-temporal understanding This is achieved by processing the 2D feature maps through a 3D convolution layer, which effectively models temporal dependencies Following the convolution, a pooling operation—such as GeM pooling—is applied to aggregate features, resulting in a comprehensive representation expressed as h I = GeM 3DCN_N ϕ [H k0, H k0+k, ], ensuring improved accuracy in image analysis and recognition tasks.

Our model utilizes a 3D convolution layer, 3DCN N ϕ, combined with the Generalized Mean Pooling (GeM) layer to effectively capture both spatial and temporal features within frame sequences This approach enables the extraction of meaningful temporal dynamics while preserving critical spatial information The output, denoted as h I, is a compact feature vector that encodes the comprehensive temporal and spatial characteristics of the input data, enhancing the model's performance in understanding complex video sequences.

This approach effectively reduces computational complexity by utilizing strategic sampling and a pretrained backbone network; however, persistent occlusion in input videos posed ongoing challenges, requiring additional processing methods To address this limitation, we developed a second processing branch, enhancing the system's ability to handle occlusions more robustly.

Pose Module

Figure 3.3 illustrates the Pose module, which utilizes a pose extractor to generate graph representations of human poses These graphs serve as the input for the ST-GCN, enabling the model to learn the complex spatial-temporal structure within the data This approach enhances the accuracy and robustness of pose-based action recognition by effectively modeling both spatial and temporal dynamics.

To enhance the Image Module, we introduce a Pose Module that captures the structural dynamics of driver movements using graph-based representations This method improves robustness against occlusions by emphasizing the spatial relationships between key body joints As demonstrated in Figure 3.3, our Pose Module combines pose extraction with spatial-temporal graph convolution networks (ST-GCN) to effectively learn and analyze meaningful motion patterns over time.

The first step in our pose analysis pipeline is to convert each input image sequence

S into a series of graph structures For each frame I t , a pose extractor is employed to generate a corresponding graph:

Each graph G t contains three main components:

– V t : the set of vertices (representing body joints) at time step t,

– E t : the set of edges connecting the joints (bones),

– X t : the feature matrix containing the 2D coordinates of each joint.

These graphs are then organized into a dynamic sequence that captures the temporal evolution of the driver’s pose:

This sequence S G represents the time-series graph data that will be processed in subsequent steps.

To accurately capture both spatial and temporal relationships, we utilize a union graph approach, which integrates spatial connections within individual frames and temporal links across successive frames This method enhances the modeling of dynamic interactions in complex scenes, providing a comprehensive understanding of spatial-temporal dependencies.

At each time step t, the graph \( G_t \) is enhanced with temporal edges linking the joints in frame t to those in frame t+1, creating a unified spatiotemporal graph The adjacency matrix \( A_S \) captures both the spatial connections within individual frames and the temporal connections across consecutive frames, providing a comprehensive representation of joint interactions over time Additionally, the matrix \( X_S \) records the joint coordinates across all time steps, enabling detailed analysis of joint movements and temporal dynamics This approach facilitates accurate modeling of dynamic human motions in spatiotemporal graph representations for improved action recognition and motion analysis.

3.4.3 Feature Learning through Graph Convolution

Our Pose Module is centered around the ST-GCN, a powerful network that captures spatial and temporal features by processing graph sequences At each layer, the node features are effectively updated by leveraging both spatial connections and temporal dynamics, enabling accurate and robust understanding of complex motion patterns.

– W s l andW t l are the learnable weight matrices for spatial and temporal convolutions at layer l,

– Aˆ s =I+As represents the augmented adjacency matrix with self-loops to ensure each joint’s features are preserved,

– D s is the diagonal node degree matrix,

– σ is the ReLU activation function.

The initial node featuresB 0 are obtained by projecting the joint coordinates into an embedding space:

To generate the final pose representation, we first apply average pooling to the previous layer's output, followed by a fully connected layer, represented as \( h_P = F_C g (AvgPooling(B L-1)) \) The complete pose representation is then obtained through a trainable function \( h_P = f_\psi (SG) \), where \( \psi \) includes all the trainable parameters, such as weight matrices, ensuring an accurate and robust encoding of the 3D human pose for improved performance in pose estimation tasks.

The feature vector, denoted as h P, is derived from a series of inputs including W0, Ws0, Wt1, Ws2, Wt2, and others, along with the fully connected layer FCg This comprehensive vector effectively captures the temporal and spatial dynamics of the pose sequence, providing a detailed representation of the driver’s posture and movement patterns By analyzing these features, the model can accurately interpret and classify driver behaviors, enhancing safety and monitoring systems.

We utilize a pretrained pose extractor to generate initial graphs (G t), simplifying the training process However, extracting accurate driver-seat poses is challenging due to occlusion and camera distortion, which can introduce noise into the pose graphs This noise may impact the quality of the learned representations, emphasizing the need for robust pose estimation methods in driver monitoring systems.

To overcome challenges in automotive pose estimation, we are focusing on developing advanced, environment-specific pose extraction techniques These improvements will significantly boost the accuracy and adaptability of the Pose Module, ensuring reliable performance in real-world automotive scenarios.

Evaluation 31

Dataset

4.1.1 The NVIDIA AI City Challenge 2023 Dataset

The 7th NVIDIA AI City Challenge 2023 introduced the Synthetic Distracted Driving (SynDD2) dataset, providing a robust foundation for driver behavior analysis This comprehensive dataset includes 210 high-resolution video clips, each approximately 9 minutes long, captured at 1920×1080 pixels resolution and 30 fps Designed for developing and evaluating models related to distracted driving detection, the SynDD2 dataset is highly suitable for advancing research in this field.

The data collection process employed a synchronized three-camera system strategically positioned inside the vehicle to capture multiple perspectives of driver behavior The camera setup configuration is detailed in Table 4.1 and illustrated in Figure 4.1, ensuring comprehensive coverage of the driver’s actions and interactions This setup enhances the dataset’s effectiveness for behavior recognition tasks, providing valuable insights into driver behavior from different angles.

Table 4.1: The three in-vehicle camera views for driver behavior recognition [41]

Dash Cam 2 Behind rear view mirror Dash Cam 3 Top right side window

Figure 4.1: Camera mounting setup for the three views listed in Table 4.1 [41]

The selected camera positions are strategically chosen to capture diverse driver behaviors from multiple angles, enhancing safety monitoring Dash Cam 1 offers a clear frontal view of the driver, while Dash Cam 2 provides an over-the-shoulder perspective from behind the rearview mirror Additionally, Dash Cam 3 records a side view through the top right window This multi-angle setup ensures comprehensive coverage, effectively recording various distracted behaviors such as texting or reaching for objects, thereby improving the reliability of driver behavior analysis.

The SynDD2 dataset features recordings of 35 drivers performing 16 different distracted driving behaviors while stationary, capturing diverse real-world scenarios To enhance realism, some drivers worn accessories such as sunglasses and hats, adding variability to the data The driving activities were recorded in random sequences and durations, closely simulating natural driving conditions, as outlined in Table 4.2.

Table 4.2: List of distracted driving activities in the SynDD2 dataset [42]

Sr no Distracted Driver Behavior

9 Picking up from floor (driver)

10 Picking up from floor (passenger)

11 Talking to passenger at the right

12 Talking to passenger at backseat

15 Singing and dancing with music

This article highlights common distracted driving behaviors that can significantly impact safety Activities like "Texting (left)", where the driver uses their left hand to send messages, and "Adjusting control panel", involving interaction with vehicle controls, are carefully defined to enable accurate detection and analysis of these distractions This detailed categorization helps improve understanding of how specific actions impair driving performance and enhances efforts to promote safer driving habits.

The SynDD2 dataset is divided into three parts: A1, A2, and B The A1 subset features fully labeled videos with precise annotations of distraction types and their temporal boundaries, enabling detailed analysis This thesis concentrates solely on the A1 subset, which shares the same data format as parts A2 and B, ensuring consistency across all dataset segments for seamless research integration.

Dataset A1 is structured using a text file format where each line corresponds to a single action instance The format is as follows:

– video_id: The unique identifier for each video, starting from 1.

– activity_id: The numerical label corresponding to the distracted driving behavior as defined in Table 4.2.

– start_time: The timestamp indicating when the action begins, measured in seconds (integer value) For example, a start time of 127 represents the 130th second (2 minutes and 10 seconds) of the video.

The "end_time" parameter specifies the exact moment an action concludes, measured in seconds as an integer value For instance, an end_time of 263 indicates that the action ends at the 263rd second of the video, which is equivalent to 4 minutes and 23 seconds Accurate setting of end_time ensures precise control over video segmentation and timing, essential for effective content editing and synchronization Incorporating this parameter helps optimize your video workflows by clearly defining where each action or segment concludes.

This structured format facilitates easy parsing and processing of the data for subsequent analysis and model training.

To enhance the robustness and generalizability of the proposed model, a driver-independent data split strategy was implemented, which involves dividing the dataset based on different drivers This approach effectively prevents data leakage and ensures the model is evaluated on entirely unseen driver data, thereby providing a more accurate assessment of its real-world performance The specific details of the data split are designed to maximize model reliability and validity across diverse driving scenarios.

– Training set: Comprises videos from 20 drivers.

– Testing set: Comprises videos from 5 drivers.

– Total training duration: Approximately 200 minutes of video data.

This partitioning strategy guarantees that the model’s evaluation reflects its ability to generalize to new drivers, thereby providing a more accurate measure of its performance in real-world scenarios.

For effective activity recognition, videos were segmented into two-second clips, each containing 60 frames, starting from the first frame to ensure consistent temporal alignment across all samples To enhance model performance, the training dataset was augmented with both annotated activity sequences and unannotated background segments, following established methodologies [43], which improves the model's ability to differentiate between distracted and non-distracted behaviors Additionally, various data augmentation techniques were employed during preprocessing to further increase the model's robustness and generalization capabilities.

– Horizontal Flipping: Randomly flipping the video frames horizontally to sim- ulate different driver orientations.

Normalization of video frames is essential for stable and efficient training, as all pixel values were scaled to the range [0, 1] This step helps to accelerate model convergence by ensuring consistent input data, ultimately improving training stability and performance Proper normalization significantly enhances the overall effectiveness of the training process in video analysis tasks.

These augmentation strategies help in increasing the diversity of the training data, thereby enhancing the model’s ability to generalize to various real-world conditions.

Experimental Setup

The experimental setup includes detailed hardware and software configurations, evaluation metrics, and baseline models for comparison Experiments were carried out on a workstation with an NVIDIA GPU, ensuring efficient training and testing of deep learning models.

– CPU: Intel Xeon Gold 6148 (80) @ 3.700GHz

– Additional Libraries: NumPy, OpenCV, Scikit-learn, mmaction2

We utilized various EfficientNet models—ranging from small to large—pre-trained on ImageNet-21K for the image encoding module To achieve accurate pose extraction, we employed a 2D Top-Down pose estimator using HRNet [13], generating full-body joint coordinates with 17 vertices per video The input images were resized to 512x512 pixels, matching the image channel size, and the sequence comprised 60 frames, resulting in a total of 1020 joint graphs Key architectural parameters were optimized for our experimental setup.

Table 4.3: Key parameters of our proposed architecture for experiments

Key parameters are summarized in Table 4.3, with the input image pre-processed to 512x512 pixels with three channels and sampled using τ = 4 For feature extraction, the image module outputs a 512-dimensional feature vector, while the 2D pose feature extractor produces a 256-dimensional vector The image feature mapping branch generates 512 features, and the pose feature mapping branch outputs 32 features The merge module employs concatenation, as described in equation (3.14), resulting in a combined feature dimension before entering the fully connected layer The final output predicts one of 16 distinct actions.

Key hyperparameters were tuned to optimize the model’s performance:

– Learning Rate: Set to 1e−4 with a decay schedule to facilitate convergence. – Batch Size: Set to 12 to balance training speed and memory usage.

– Number of Epochs: Trained for 30 epochs

– Optimizer: Adam [44] optimizer was chosen for its adaptive learning rate capabilities.

– Loss Function: Cross-entropy loss was utilized for multi-class classification.

To enhance the model’s generalization capabilities, the following strategies were employed:

– Dropout: Applied a dropout rate of 0.2 to prevent overfitting.

Mixup [45] is an effective data augmentation technique that enhances model robustness by combining two random samples and their labels using a mixing coefficient drawn from a beta distribution This method helps prevent overfitting by creating virtual training examples, leading to improved generalization capabilities in machine learning models.

– Data Augmentation: As previously mentioned, augmentation techniques were used to increase data diversity.

In this thesis, we evaluate the classification models using the accuracy metric, which measures the proportion of correct predictions made by the model Accuracy is a widely used performance indicator that reflects the overall effectiveness of the classifier Formally, accuracy is defined as the ratio of correctly predicted instances to the total number of predictions, providing a clear and straightforward measure of model performance.

Accuracy = Number of correct predictions

Experiment Results

We evaluated our proposed dual-stream architecture for driver action recognition through comprehensive experiments comparing various configurations and architectures The primary motivation behind our approach is to address the challenges posed by occlusions and diverse viewpoints, which can hinder the accuracy of traditional image-based methods By integrating both image data and pose information, our architecture leverages complementary information streams to enhance robustness and improve overall recognition performance.

Our experimental evaluation focused on three key aspects:

(1) the effectiveness of different image backbone architectures,

(2) the impact of incorporating pose information through our STGCN [33] module

(3) a comparative analysis with state-of-the-art methods.

We evaluated various image feature extraction backbones, including ResNet-50 and the EfficientNetV2 family (S, M, and L variants), demonstrating that increasing model complexity enhances accuracy in capturing spatial and appearance features from video frames The baseline results, shown in Table 4.4 under the "Image" column, highlight that deeper models like EfficientNetV2-L consistently improve performance, indicating that more sophisticated architectures better recognize nuanced visual patterns related to driver actions.

Integrating pose information through our STGCN module significantly enhances performance in challenging scenarios by effectively capturing skeletal dynamics The pose stream processes skeletal data as a dynamic graph, providing key advantages such as improved accuracy and robustness in action recognition This approach leverages the structural insights of human pose, leading to more precise and reliable results in complex environments.

1 https://developers.google.com/machine-learning/crash-course/classification/accuracy

Robustness to occlusions is a key advantage of pose estimation, as it can still accurately interpret visible body parts even when certain areas are obscured This approach offers a view-invariant representation by leveraging the skeletal structure, ensuring consistent and reliable performance across various camera angles.

- Temporal dynamics: The graph-based processing captures the evolution of pose configurations over time, which is crucial for understanding complex actions

Our results demonstrate a strong synergy between image and pose streams, with the combination of EfficientNetV2-L and STGCN achieving the highest accuracy across all camera views—Dashboard (0.8673), Rear View (0.8269), and Right Side Window (0.8313) This integrated approach proved especially effective in challenging viewpoints like the rear view and right side window, where occlusions are common, highlighting the effectiveness of our architectural choice to combine complementary information streams.

Our approach was rigorously validated by comparing it against established action recognition methods such as TSN, C3D, Slowfast, and TAM While these traditional methods are effective for general action recognition, our dual-stream architecture significantly outperformed them Specifically, in the dashboard view, TSN achieved an accuracy of 75.16%, Slowfast reached 84.49%, and our combined EfficientNetV2-L with STGCN achieved a higher accuracy of 86.73%.

Table 4.4: Accuracy comparison of our proposed method (Image + Pose) with various deep learning methods across different camera views The best results for each view are underlined

Dashboard Rear view Righ side window

We also conducted an ablation study on different graph neural network architectures for the pose stream, comparing STGCN [33] with its variants STGCN++

Our study fixed the image module backbone to EfficientNetV2-Large while evaluating various graph convolutional network architectures for driver action recognition Interestingly, the original STGCN outperformed more complex variants like STGCN++ [37], suggesting that simpler, well-designed architectures can better capture relevant patterns in driver behavior Despite architectural enhancements, STGCN++ showed slightly lower performance (Dashboard: 0.8546, Rear view: 0.8153, Right side window: 0.8167) compared to the original STGCN, indicating that increased complexity does not always lead to improved results in specialized tasks such as driver action detection This highlights the importance of selecting appropriate models tailored to specific application needs.

Our proposed dual-stream architecture effectively tackles the challenges in driver action recognition by combining robust image features from EfficientNetV2-L with structured pose information through STGCN This balanced approach enhances the model's ability to handle occlusions, viewpoint variations, and complex temporal dynamics, ensuring accurate and reliable driver action recognition.

Table 4.5: Accuracy of different architectures for pose classification The best results are highlighted in underline

Ablation Study on Output Module

The impact of the alpha ratio between image features and pose features was evaluated across three different viewing angles Notably, occlusion is more common in rear and right-side viewpoints, which can affect classification accuracy To optimize performance, the fusion weight λ must be carefully adjusted, particularly in views prone to occlusion.

Table 4.6: The number of hidden features and alpha impact the dashboard view

Concatenation 0.8561 0.8575 0.8537 Fusion withλ = 0.25 0.8415 0.8386 0.8566 Fusion withλ = 0.5 0,8576 0,8678 0.8517 Fusion withλ = 0.75 0.8474 0.8503 0,8644

In Dashboard view 4.6, the fusion strategy achieves the highest accuracy, with optimal parameters at λ = 0.50 and a hidden dimension of o = 64, indicating that a balanced contribution of image and pose features is essential for extracting relevant information When increasing the hidden dimension to o = 128, the optimal fusion weight shifts to 0.75, demonstrating the impact of dimension size on the fusion process Additionally, the concatenation strategy shows a slight decrease in performance compared to the fusion strategy within this camera setting, emphasizing the effectiveness of the fusion approach for improved accuracy.

Table 4.7: The number of hidden features and alpha impact the Rear View

Concatenation 0.8313 0.8143 0.8235 Fusion with λ= 0.25 0.8192 0.8216 0.8119 Fusion with λ= 0.50 0.8235 0.8323 0.8099 Fusion with λ= 0.75 0.8094 0.8226 0.8192

The combination strategy of the Rear view demonstrates optimal performance at λ = 0.50 with a combined latent size of 64, aligning with trends observed in the dashboard view This indicates that a balanced integration of image and pose features is crucial for achieving the best results However, increasing the hidden dimension to 128 causes a decline in performance, emphasizing the importance of carefully tuning the hidden dimension for optimal model performance.

Table 4.8: The number of hidden features and alpha impact the Right-side window

Concatenation 0.8074 0.8211 0.8211 Fusion with λ= 0.25 0.7929 0.789 0.7691 Fusion with λ= 0.50 0.7832 0.7958 0.8036 Fusion with λ= 0.75 0.8226 0.8089 0.8031

In the right side window view (View 4.8), the highest accuracy is achieved with a fusion strategy at λ = 0.50 and 32 dimensions, highlighting that a greater contribution from pose features enhances performance even with smaller hidden dimensions Overall, as the hidden dimension increases to 128, the model's performance declines, and the concatenation strategy outperforms other methods at higher dimensions These findings suggest that selecting optimal fusion parameters and smaller hidden dimensions can significantly improve accuracy in this view.

Overall, these results highlight the importance of combining strategies to balance the feature from the image representation and graph representation for different input views.

Optimizing your model's performance requires careful tuning of the alpha ratio and the number of hidden features across various viewing areas Combining EfficientNetV2-L for the image module with STGCN for the pose module delivers the best results, especially in dashboard and rear view analysis These findings underscore the significance of parameter refinement and the effectiveness of this hybrid approach in enhancing detection accuracy across different camera perspectives.

Table 4.9: Average inference complexity comparison with 1 video 2s, batch size 1 in second with increased time

Model EfficientNetV2-Small EfficientNetV2-Large

Our inference time comparison shows that the Image with Pose branch requires approximately twice the processing time of the Image-only branch due to the additional pose extraction and GCN module These results offer a relative comparison, as processing was conducted sequentially without parallelization Implementing parallel processing for the pose extraction module can significantly reduce the overall inference time, optimizing performance for real-time applications.

This paper introduces a two-stream approach for driver activity recognition by combining image-based and pose-based features The architecture includes an image branch utilizing a 2D-3D convolution module to extract meaningful representations from sampling frames, and a pose branch that uses a pose extractor to generate and analyze dynamic graphs of driver poses with a spatio-temporal graph convolution network Integrating these two streams through a merging module significantly enhances recognition accuracy, especially in challenging scenarios with occlusions.

Our experimental results on the AI City Challenge benchmark demonstrate the effectiveness of our approach across various camera views, showcasing its robustness and broad applicability The model excels in side-view scenarios where traditional single-stream methods typically face challenges due to occlusions, reducing errors by 10-15% Additionally, our approach achieves an overall accuracy improvement of 1-3% across different viewing angles, highlighting its strong generalization capabilities.

Based on our findings and valuable feedback from the thesis defense committee, several promising directions for future research have emerged:

Enhancing the pose processing module with transformer architectures can significantly improve the model’s capability to capture long-range dependencies and complex spatial relationships, leading to a more accurate understanding of intricate sequential pose patterns.

Implementing cross-stream attention mechanisms between the RGB and pose streams enhances the model's ability to identify and focus on the most relevant features from each modality This technique allows the model to effectively weigh the importance of visual appearance and skeletal pose information, leading to improved recognition accuracy for diverse action types Utilizing attention mechanisms in multi-stream networks boosts the robustness of action prediction, making the system more reliable across different scenarios Incorporating cross-stream attention is a key strategy for advancing the performance of motion recognition models in computer vision tasks.

Our current two-stage pose extraction pipeline employs HRNet for precise 2D top-down pose estimation to generate 17-vertex full-body joint coordinates; however, this approach faces computational challenges that hinder real-time applications To enhance performance, future efforts should focus on optimizing and streamlining this process, enabling faster and more efficient real-time pose detection.

– Development of a lightweight, single-stage pose estimation network that maintains accuracy while reducing computational overhead

– Implementation of model pruning and quantization techniques specifically for the pose extraction component

– Exploration of parallel processing strategies to optimize the pipeline between pose extraction and graph generation

– Investigation of more efficient backbone architectures for the pose estimator that could enable real-time performance while maintaining detection accuracy

Our research advances the field of driver behavior analysis by demonstrating the effectiveness of integrating multiple information streams for accurate action recognition This multi-stream approach is especially valuable in real-world conditions, where traditional single-stream methods often struggle due to occlusions and complex viewing angles Although our current findings are promising, future research avenues indicate substantial potential to enhance both the accuracy and practical application of driver behavior analysis technologies.

[1] Y Xing et al “Driver activity recognition for intelligent vehicles: A deep learning approach.” In:IEEE transactions on Vehicular Technology 68.6 (2019), pp 5379– 5390.

[2] Jiyang Wang et al “A survey on driver behavior analysis from in-vehicle cameras.” In: IEEE Transactions on Intelligent Transportation Systems 23.8 (2021), pp 10186–10209.

[3] Chen Huang et al “HCF: A hybrid CNN framework for behavior detection of distracted drivers.” In: IEEE access 8 (2020), pp 109335–109349.

[4] Z Shuchang “A Survey on Human Action Recognition.” In: ArXiv (2022). eprint: 2301.06082.

[5] Karen Simonyan and Andrew Zisserman “Very deep convolutional networks for large-scale image recognition.” In: arXiv preprint arXiv:1409.1556 (2014).

[6] Kaiming He et al “Deep residual learning for image recognition.” In: CoRR abs/1512.03385 (2015).

[7] Du Tran et al “Learning spatiotemporal features with 3d convolutional networks.” In:Proceedings of the IEEE international conference on computer vision.

[8] Christoph Feichtenhofer et al “SlowFast Networks for Video Recognition.” In:Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV) Oct 2019.

[9] Limin Wang et al “Temporal segment networks for action recognition in videos.” In:IEEE transactions on pattern analysis and machine intelligence 41.11 (2018), pp 2740–2755.

[10] Milind Naphade et al “The 7th AI City Challenge.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2023 arXiv: 2304.07500.

[11] Giancarlo Paoletti et al “Unsupervised human action recognition with skeletal graph laplacian and self-supervised viewpoints invariance.” In: arXiv preprint arXiv:2204.10312 (2022).

The study introduces GLTA-GCN, a novel global-local temporal attention graph convolutional network designed for unsupervised skeleton-based action recognition This approach leverages both global and local temporal attention mechanisms to effectively model complex human motion sequences Experiments conducted at the 2022 IEEE International Conference on Multimedia and Expo (ICME) demonstrate that GLTA-GCN achieves superior accuracy and robustness compared to existing methods, highlighting its potential for real-world applications in action recognition The research emphasizes the importance of integrating multi-scale temporal attention strategies to enhance the performance of skeleton-based models in unsupervised learning settings.

[13] Z Cao et al “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” In: CVPR 2017.

[14] Ke Sun et al “Deep high-resolution representation learning for human pose estimation.” In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019, pp 5686–5696.

[15] D R Beddiar et al “Vision-based human activity recognition: a survey.” In: Multimedia Tools and Applications 79.41 (2020), pp 30509–30555.

[16] K Soomro, A R Zamir, and M Shah “UCF101: A dataset of 101 human actions classes from videos in the wild.” In: CoRR abs/1212.0402 (2012).

[17] H Kuehne et al “HMDB: A large video database for human motion recognition.” In: ICCV 2011, pp 2556–2563.

[18] Will Kay et al The Kinetics Human Action Video Dataset 2017 arXiv: 1705.

06950 [cs.CV] url: https://arxiv.org/abs/1705.06950.

[19] Xiaojiang Peng et al “Action Recognition with Stacked Fisher Vectors.” In:Computer Vision – ECCV 2014 Ed by David Fleet et al Cham: Springer In- ternational Publishing, 2014, pp 581–595 isbn: 978-3-319-10602-1.

[20] L Gonzalez, S A Velastin, and G Acuna “Silhouette-based human action recognition with a multi-class support vector machine.” In: IET Conference Pro- ceedings The Institution of Engineering & Technology 2018.

[21] Yi Zhu et al A Comprehensive Study of Deep Video Action Recognition 2020. arXiv: 2012.06567 [cs.CV] url: https://arxiv.org/abs/2012.06567.

[22] Jeff Donahue et al Long-term Recurrent Convolutional Networks for Visual Recognition and Description 2016 arXiv: 1411 4389 [cs.CV] url: https : //arxiv.org/abs/1411.4389.

[23] Du Tran et al Learning Spatiotemporal Features with 3D Convolutional Net- works 2015 arXiv: 1412.0767 [cs.CV] url: https://arxiv.org/abs/1412. 0767.

[24] Joao Carreira and Andrew Zisserman Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset 2018 arXiv:1705.07750 [cs.CV].url:https: //arxiv.org/abs/1705.07750.

[25] Zixuan Tang et al A Survey on Backbones for Deep Video Action Recognition.

2024 arXiv: 2405.05584 [cs.CV] url: https://arxiv.org/abs/2405.05584.

[26] Zhaoyang Liu et al TAM: Temporal Adaptive Module for Video Recognition.

2021 arXiv: 2005.06803 [cs.CV] url: https://arxiv.org/abs/2005.06803.

[27] Karen Simonyan and Andrew Zisserman Two-Stream Convolutional Networks for Action Recognition in Videos 2014 arXiv:1406.2199 [cs.CV].url:https: //arxiv.org/abs/1406.2199.

[28] Manuel Martin et al “Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles.” In: 2019 IEEE/CVF In- ternational Conference on Computer Vision (ICCV) 2019, pp 2801–2810 doi: 10.1109/ICCV.2019.00289.

[29] Okan Kopuklu et al “Driver anomaly detection: A dataset and contrastive learning approach.” In: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision 2021, pp 91–100.

[30] Hesham M Eraqi et al “Driver distraction identification with an ensemble of convolutional neural networks.” In: Journal of Advanced Transportation 2019 (2019).

[31] Amir Shahroudy et al “NTU RGB+D: A large scale dataset for 3D human activity analysis.” In: Proceedings of the IEEE conference on computer vision and pattern recognition 2016, pp 1010–1019.

[32] Jun Liu et al “NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 42.10 (2020), pp 2684–2701.

[33] S Yan, Y Xiong, and D Lin “Spatial temporal graph convolutional networks for skeleton-based action recognition.” In: Proceedings of the AAAI Conference on Artificial Intelligence Vol 32 1 2018.

[34] Bibin Sebastian Human Action Recognition using Detectron2 and LSTM Ac- cessed: 14 Jul 2024 2021 url: https://learnopencv.com/human- action- recognition-using-detectron2-and-lstm/.

[35] R Li et al “Adaptive graph convolutional neural networks.” In: Proceedings of the AAAI Conference on Artificial Intelligence Vol 32 1 2018.

[36] L Shi et al “Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks.” In: IEEE Transactions on Image Processing

[37] Haodong Duan et al.PYSKL: Towards Good Practices for Skeleton Action Recog- nition 2022 arXiv:2205.09443 [cs.CV].url:https://arxiv.org/abs/2205. 09443.

[38] Filip Radenovi´c, Giorgos Tolias, and Ondˇrej Chum Fine-tuning CNN Image Retrieval with No Human Annotation 2018 arXiv: 1711.02512 [cs.CV] url: https://arxiv.org/abs/1711.02512.

[39] Milind Naphade et al “The 7th AI City Challenge.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops June 2023.

Tiêu đề	Apply Graph Neural Network for Driver Activity Recognition from Multiple Cameras
Tác giả	Nguyen Tien Dat
Người hướng dẫn	Dr. Ta Viet Cuong
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Master’s Thesis
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	63
Dung lượng	2,89 MB