1. Trang chủ
  2. » Luận Văn - Báo Cáo

00051001897 apply graph neural network for driver activity recognition from multiple cameras

75 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Apply Graph Neural Network for Driver Activity Recognition from Multiple Cameras
Tác giả Nguyen Tien Dat
Người hướng dẫn Dr. Ta Viet Cuong
Trường học Vietnam National University, Ha Noi University of Engineering and Technology
Chuyên ngành Computer Science
Thể loại Thesis
Năm xuất bản 2024
Thành phố Ha Noi
Định dạng
Số trang 75
Dung lượng 279,25 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability.. Howeve

Trang 1

VIETNAM NATIONAL UNIVERSITY, HA NOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Tien Dat

Apply Graph Neural Network for Driver Activity Recognition

from Multiple Cameras

MASTER’S THESISMajor: Computer Science

HA NOI - 2024

Trang 2

VIETNAM NATIONAL UNIVERSITY, HA NOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Tien Dat

Apply Graph Neural Network for Driver Activity Recognition

from Multiple Cameras

MASTER’S THESISMajor: Computer ScienceCode: 8480101.01

Supervisor: Dr Ta Viet Cuong

Trang 3

I hereby declare that the work contained in this thesis is of my own and hasnot been previously submitted for a degree or diploma at this or any other highereducation institution To the best of my knowledge and belief, the thesis contains nomaterials previously published or written by another person except where due reference

or acknowledgement is made

Signature

Trang 4

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 6

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 7

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 8

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 9

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 11

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 12

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 13

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 14

List of Figures

2.1 Example classes from the Kinetics dataset [18] 5

2.2 A chronological overview of recent representative work in video action recognition from 2014 to 2020 [21] 6

2.3 CNN and LSTM combine architecture [22] 7

2.4 2D and 3D convolution operations [23] 8

2.5 3D CNN architecture [25] 9

2.6 Two-stream architecture [25] 10

2.7 A SlowFast network [8] 11

2.8 Temporal segment network [9] 12

2.9 The 17 keypoints used to represent the human body in skeleton-based action recognition The left part of the image shows a person, the middle part lists the keypoints, and the right part shows their corresponding positions on the body [34] 15

2.10 The pipeline of ST-GCN for skeleton-based action recognition 18

3.1 The overview of our proposed two-stream architecture for combination between the Image stream and the Pose stream 21

3.2 The overview of the Image Module which contains a preprocessing step, a 2D convolution, and 3D convolution operator for extracting features from the sequence of images 23

3.3 The overview of the Pose module which employs a pose extractor to create the graph representations and uses ST-GCN to learn the spatial-temporal structure from the output graph 25

Trang 15

3.3.2 Feature Extraction Pipeline 24

3.3.3 Temporal Feature Learning 24

3.4 Pose Module 25

3.4.1 Graph-based Pose Representation 25

3.4.2 Spatial-Temporal Graph Construction 26

3.4.3 Feature Learning through Graph Convolution 26

3.4.4 Implementation Details and Challenges 27

3.5 Output Module 28

3.5.1 Feature Transformation 29

3.5.2 Feature Integration Strategies 29

3.5.3 Classification and Training 30

Chapter 4 Evaluation 31 4.1 Dataset 31

4.1.1 The NVIDIA AI City Challenge 2023 Dataset 31

4.1.2 Data Collection Setup 32

4.1.3 Dataset Content and Structure 33

4.1.4 Dataset Organization and Preprocessing 34

4.2 Experimental Setup 36

4.2.1 Hyperparameter Configuration 37

4.2.2 Training Strategy 38

4.2.3 Evaluation Metrics 38

4.3 Experiment Results 39

4.4 Ablation Study on Output Module 42 Chapter 5 Conclusion and Future Work 45

Trang 16

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 17

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 19

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 20

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 21

List of Figures

2.1 Example classes from the Kinetics dataset [18] 5

2.2 A chronological overview of recent representative work in video action recognition from 2014 to 2020 [21] 6

2.3 CNN and LSTM combine architecture [22] 7

2.4 2D and 3D convolution operations [23] 8

2.5 3D CNN architecture [25] 9

2.6 Two-stream architecture [25] 10

2.7 A SlowFast network [8] 11

2.8 Temporal segment network [9] 12

2.9 The 17 keypoints used to represent the human body in skeleton-based action recognition The left part of the image shows a person, the middle part lists the keypoints, and the right part shows their corresponding positions on the body [34] 15

2.10 The pipeline of ST-GCN for skeleton-based action recognition 18

3.1 The overview of our proposed two-stream architecture for combination between the Image stream and the Pose stream 21

3.2 The overview of the Image Module which contains a preprocessing step, a 2D convolution, and 3D convolution operator for extracting features from the sequence of images 23

3.3 The overview of the Pose module which employs a pose extractor to create the graph representations and uses ST-GCN to learn the spatial-temporal structure from the output graph 25

Trang 22

3.1 Problem Statement 203.2 Overall Architecture 21

Trang 23

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 24

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 26

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 27

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 28

3.1 Problem Statement 203.2 Overall Architecture 21

Trang 29

Recognizing the driver’s activity plays an important role in ensuring drivingsafety in practice However, the usual approach relying on image video streams em-ploys large deep network models and does not work as expected because of occlusionissues between different driver’s actions In our approach, we propose the combination

of image video stream and pose stream for tackling these challenges Firstly, in theimage stream, lightweight deep network models are used by sampling images within

a fixed time window In addition to that, a pose detection model is used to extractthe driver’s pose in the form of a dynamic graph The extracted dynamic graph isthen learned with a spatio-temporal graph convolution network Subsequently, thetwo streams are joined with a merging module to predict the actions Our proposedmodel is tested on the AI City Challenge benchmark with different camera views Theresults show that our model can improve accuracy from 1% to 3% based on differentviews Moreover, when the input image is from the side views which make the actionsprone to occlusion, our model reduces errors by around 10-15%

Trang 30

3.1 Problem Statement 203.2 Overall Architecture 21

Trang 31

3.1 Problem Statement 203.2 Overall Architecture 21

Trang 32

First and foremost, I would like to express my deepest gratitude to my mentorand advisor, Dr Viet Cuong Ta I am truly thankful to him for his dedicated guidanceand support, not only in research but also in both my professional and personal life.There were times when I thought I would give up on my master’s studies, not knowingwhich topic to pursue, but he always believed in me and motivated me, even though

I have been delayed by a year His encouragement and trust have been invaluable,pushing me to overcome the challenges I sincerely thank him for all his help

Next, I would like to express my heartfelt thanks to my family They have alwaysbeen a strong pillar of support and a great source of motivation throughout my aca-demic journey, from my university days up until now, marking almost seven years ofstudying at the University of Technology Their unconditional support has been thefoundation for me to overcome all obstacles on my path to education and career

I would also like to extend my gratitude to the Faculty of Information Technologyand all the professors Thanks to their support and the provision of the best tools andknowledge, I have been able to pursue and develop my career after graduation Theknowledge and experience I gained from the faculty have been invaluable assets for

my future

Finally, I want to thank myself Thank you for not giving up and for seeingthis journey through to the end, even if it took a little longer than expected Mypersistence and determination have led me to complete this program, and I am proud

of that

This research was conducted with funding support from the research projectQG.23.32 of Vietnam National University, Hanoi

Trang 33

3.4 The overview of the Output module which combines the Image sentation and Pose representation into a single representation for learn-ing the driver’s action class probability 28

repre-4.1 Camera mounting setup for the three views 32

Trang 34

3.3.2 Feature Extraction Pipeline 243.3.3 Temporal Feature Learning 243.4 Pose Module 253.4.1 Graph-based Pose Representation 253.4.2 Spatial-Temporal Graph Construction 263.4.3 Feature Learning through Graph Convolution 263.4.4 Implementation Details and Challenges 273.5 Output Module 283.5.1 Feature Transformation 293.5.2 Feature Integration Strategies 293.5.3 Classification and Training 30

4.1 Dataset 314.1.1 The NVIDIA AI City Challenge 2023 Dataset 314.1.2 Data Collection Setup 324.1.3 Dataset Content and Structure 334.1.4 Dataset Organization and Preprocessing 344.2 Experimental Setup 364.2.1 Hyperparameter Configuration 374.2.2 Training Strategy 384.2.3 Evaluation Metrics 384.3 Experiment Results 394.4 Ablation Study on Output Module 42Chapter 5 Conclusion and Future Work 45

Trang 35

3.1 Problem Statement 203.2 Overall Architecture 21

Trang 36

3.3.2 Feature Extraction Pipeline 243.3.3 Temporal Feature Learning 243.4 Pose Module 253.4.1 Graph-based Pose Representation 253.4.2 Spatial-Temporal Graph Construction 263.4.3 Feature Learning through Graph Convolution 263.4.4 Implementation Details and Challenges 273.5 Output Module 283.5.1 Feature Transformation 293.5.2 Feature Integration Strategies 293.5.3 Classification and Training 30

4.1 Dataset 314.1.1 The NVIDIA AI City Challenge 2023 Dataset 314.1.2 Data Collection Setup 324.1.3 Dataset Content and Structure 334.1.4 Dataset Organization and Preprocessing 344.2 Experimental Setup 364.2.1 Hyperparameter Configuration 374.2.2 Training Strategy 384.2.3 Evaluation Metrics 384.3 Experiment Results 394.4 Ablation Study on Output Module 42Chapter 5 Conclusion and Future Work 45

Ngày đăng: 03/08/2025, 08:46

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN