3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability.. Howeve
Trang 1VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Tien Dat
Apply Graph Neural Network for Driver Activity Recognition
from Multiple Cameras
MASTER’S THESISMajor: Computer Science
HA NOI - 2024
Trang 2VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Tien Dat
Apply Graph Neural Network for Driver Activity Recognition
from Multiple Cameras
MASTER’S THESISMajor: Computer ScienceCode: 8480101.01
Supervisor: Dr Ta Viet Cuong
Trang 3I hereby declare that the work contained in this thesis is of my own and hasnot been previously submitted for a degree or diploma at this or any other highereducation institution To the best of my knowledge and belief, the thesis contains nomaterials previously published or written by another person except where due reference
or acknowledgement is made
Signature
Trang 4First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 63.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 7Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 8First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 9Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 11Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 123.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 13First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 14List of Figures
2.1 Example classes from the Kinetics dataset [18] 5
2.2 A chronological overview of recent representative work in video action recognition from 2014 to 2020 [21] 6
2.3 CNN and LSTM combine architecture [22] 7
2.4 2D and 3D convolution operations [23] 8
2.5 3D CNN architecture [25] 9
2.6 Two-stream architecture [25] 10
2.7 A SlowFast network [8] 11
2.8 Temporal segment network [9] 12
2.9 The 17 keypoints used to represent the human body in skeleton-based action recognition The left part of the image shows a person, the middle part lists the keypoints, and the right part shows their corresponding positions on the body [34] 15
2.10 The pipeline of ST-GCN for skeleton-based action recognition 18
3.1 The overview of our proposed two-stream architecture for combination between the Image stream and the Pose stream 21
3.2 The overview of the Image Module which contains a preprocessing step, a 2D convolution, and 3D convolution operator for extracting features from the sequence of images 23
3.3 The overview of the Pose module which employs a pose extractor to create the graph representations and uses ST-GCN to learn the spatial-temporal structure from the output graph 25
Trang 153.3.2 Feature Extraction Pipeline 24
3.3.3 Temporal Feature Learning 24
3.4 Pose Module 25
3.4.1 Graph-based Pose Representation 25
3.4.2 Spatial-Temporal Graph Construction 26
3.4.3 Feature Learning through Graph Convolution 26
3.4.4 Implementation Details and Challenges 27
3.5 Output Module 28
3.5.1 Feature Transformation 29
3.5.2 Feature Integration Strategies 29
3.5.3 Classification and Training 30
Chapter 4 Evaluation 31 4.1 Dataset 31
4.1.1 The NVIDIA AI City Challenge 2023 Dataset 31
4.1.2 Data Collection Setup 32
4.1.3 Dataset Content and Structure 33
4.1.4 Dataset Organization and Preprocessing 34
4.2 Experimental Setup 36
4.2.1 Hyperparameter Configuration 37
4.2.2 Training Strategy 38
4.2.3 Evaluation Metrics 38
4.3 Experiment Results 39
4.4 Ablation Study on Output Module 42 Chapter 5 Conclusion and Future Work 45
Trang 163.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 17Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 193.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 203.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 21List of Figures
2.1 Example classes from the Kinetics dataset [18] 5
2.2 A chronological overview of recent representative work in video action recognition from 2014 to 2020 [21] 6
2.3 CNN and LSTM combine architecture [22] 7
2.4 2D and 3D convolution operations [23] 8
2.5 3D CNN architecture [25] 9
2.6 Two-stream architecture [25] 10
2.7 A SlowFast network [8] 11
2.8 Temporal segment network [9] 12
2.9 The 17 keypoints used to represent the human body in skeleton-based action recognition The left part of the image shows a person, the middle part lists the keypoints, and the right part shows their corresponding positions on the body [34] 15
2.10 The pipeline of ST-GCN for skeleton-based action recognition 18
3.1 The overview of our proposed two-stream architecture for combination between the Image stream and the Pose stream 21
3.2 The overview of the Image Module which contains a preprocessing step, a 2D convolution, and 3D convolution operator for extracting features from the sequence of images 23
3.3 The overview of the Pose module which employs a pose extractor to create the graph representations and uses ST-GCN to learn the spatial-temporal structure from the output graph 25
Trang 223.1 Problem Statement 203.2 Overall Architecture 21
Trang 23First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 24First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 26First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 27Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 283.1 Problem Statement 203.2 Overall Architecture 21
Trang 29Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination
of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within
a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%
Trang 303.1 Problem Statement 203.2 Overall Architecture 21
Trang 313.1 Problem Statement 203.2 Overall Architecture 21
Trang 32First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though
I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help
Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career
I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for
my future
Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud
of that
This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi
Trang 333.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28
repre-4.1 Camera mounting setup for the three views 32
Trang 343.3.2 Feature Extraction Pipeline 243.3.3 Temporal Feature Learning 243.4 Pose Module 253.4.1 Graph-based Pose Representation 253.4.2 Spatial-Temporal Graph Construction 263.4.3 Feature Learning through Graph Convolution 263.4.4 Implementation Details and Challenges 273.5 Output Module 283.5.1 Feature Transformation 293.5.2 Feature Integration Strategies 293.5.3 Classification and Training 30
4.1 Dataset 314.1.1 The NVIDIA AI City Challenge 2023 Dataset 314.1.2 Data Collection Setup 324.1.3 Dataset Content and Structure 334.1.4 Dataset Organization and Preprocessing 344.2 Experimental Setup 364.2.1 Hyperparameter Configuration 374.2.2 Training Strategy 384.2.3 Evaluation Metrics 384.3 Experiment Results 394.4 Ablation Study on Output Module 42Chapter 5 Conclusion and Future Work 45
Trang 353.1 Problem Statement 203.2 Overall Architecture 21
Trang 363.3.2 Feature Extraction Pipeline 243.3.3 Temporal Feature Learning 243.4 Pose Module 253.4.1 Graph-based Pose Representation 253.4.2 Spatial-Temporal Graph Construction 263.4.3 Feature Learning through Graph Convolution 263.4.4 Implementation Details and Challenges 273.5 Output Module 283.5.1 Feature Transformation 293.5.2 Feature Integration Strategies 293.5.3 Classification and Training 30
4.1 Dataset 314.1.1 The NVIDIA AI City Challenge 2023 Dataset 314.1.2 Data Collection Setup 324.1.3 Dataset Content and Structure 334.1.4 Dataset Organization and Preprocessing 344.2 Experimental Setup 364.2.1 Hyperparameter Configuration 374.2.2 Training Strategy 384.2.3 Evaluation Metrics 384.3 Experiment Results 394.4 Ablation Study on Output Module 42Chapter 5 Conclusion and Future Work 45