Socially aware robot navigation framework social activities recognition using deep learning techniques

Socially aware robot navigation framework: Social activities recognition using deep learning techniques Ngoc Anh Pham, Lan Anh Nguyen and Xuan Tung Truong Faculty of Control Engineering,

Trang 1

Socially aware robot navigation framework: Social activities recognition using deep learning techniques

Ngoc Anh Pham, Lan Anh Nguyen and Xuan Tung Truong Faculty of Control Engineering, Le Quy Don Technical University

Hanoi, Vietnam xuantung.truong@gmail.com

Abstract—In this study, we propose a deep learning

based-social activities recognition algorithm for based-socially aware mobile

robot navigation framework The proposed method utilizes the

OpenPose library and the Long-short term memory deep learning

neural network, which observes the human skeleton in some time

steps, then predicts that the human social activities including

human running, walking, standing, sitting and laying We train

and test the proposed deep learning neural network model on

a dataset that we synthesize The experimental results illustrate

that our proposed method can predict the human social activities

with higher accuracy

Index Terms—Social activities recognition, OpenPose, LSTM,

Socially aware navigation, Mobile service robot

I INTRODUCTION

In recent years, autonomous mobile robots are increasingly

researched, developed and applied in social life and also in the

military field The strong development of the fourth scientific

and technological revolution together with the trend of

glob-alization have been a strong driving force in manufacturing

technology and the application of autonomous mobile robots

in all areas of life

Although today’s modern robot navigation systems are

capa-ble of driving the mobile robot to avoid and approach humans

in a socially acceptable manner, and providing respectful and

polite behaviors akin to the humans [1], [2] and [3], they still

surfer the following drawbacks if we wish to deploy the robots

into our daily life settings: (i) a robot should react according

to social cues and signals of humans (facial expression, voice

pitch and tone, body language, human gestures), (ii) a robot

should predict future action of the human [4], and (iii) a robot

should be able to estimate the social activities of the human

in its vicinity

Robots navigate in social environments affected greatly

by human navigation processes as well as the decisions of

humans Therefore robots can make better decisions if they

knew in advance plans that humans will make in the future to

foresee trajectories of them The previous research works in

trajectory prediction has some challenges due to the inherent

properties of human motion in crowded scenes [5] such as

interpersonal, socially acceptable and multiple trajectories

The traditional methods based on hand-crafted features [6]

have addressed exhaustively in term of interpersonal problems

Methods based on Recurrent Neural Networks (RNNs) [7]

and [8] can effectively adapt in term of socially acceptable

aspect The methods used Long Short-Term Memory networks (LSTM) to jointly reason across multiple agents to predict their trajectories in a scene Beside, the problem of multiple trajectories has been studied in the context of route choices

in given scene [5] Moreover, in [9] the authors demonstrate that pedestrians have different navigation choices in crowded scenes depending on their personal properties

Nevertheless, in order to investigate human movement as well as support human future trajectory predictions, predicting human social activities is a very important part Because it allows a mobile robot to automatically predict situations of humans to actively set up respective action scenarios Human social activities prediction has been studied and incorporated into robotic systems such as applied to trajectory planning

of robot arms [10] and [11], mobile robots [12], and au-tonomous driving [13] The authors used Hidden Markov Models (HMMs) to model activities and recognize human intents [11] and [12] Beside, by employing radial basis func-tion neural networks (RBFNN) in [10], the mofunc-tion intenfunc-tion

of the human has been estimated However, the systems have suffered from a limitation if the number of human action types increases or in crowded scenes

Therefore, in this study, we propose an improving system

to predict social activities, including standing, sitting, lying and walking human, which uses the output of the Open Pose model and a deep Long-Short Term Memory network

We aim to apply the output of this system to the socially aware navigation system of mobile robots which helps them avoid human effectively in crowded environments Because it enables the robot understand human’s intentions and foresee their future trajectories

The remainder of this paper is organized as follows Sec-tion II introduces the background informaSec-tion that will be utilized in the paper The human social activities prediction algorithm using deep learning techniques is presented in Section III Section IV shows the experimental results The conclusion is provided in Section V

II BACKGROUNDINFORMATION

A The Overview of the OpenPose Model The OpenPose model is a real-time multi-person keypoint detection library for body, face, hands and foot estimation The OpenPose model was created by CMU Perceptual Computing

Trang 2

of a convolutional neural network with two branches, in which

the first branch was the reliability map and the second branch

was the polynomial activation functions set

Fig 1 The architecture of the OpenPose algorithm.

The inputs of the OpenPose model are the image, video,

webcam, Flir/Point Grey and IP Camera The outputs of

the OpenPose model are the basic image and keypoint

dis-play/saving in popular image formats such as PNG, JPG,

AVIm, etc or keypoint saving as JSON, XML, etc The

number of body keypoints that can be exported from OpenPose

model is 15, 18 or 25-keypoints

Fig 2 The example result of the OpenPose algorithm

In particular, the authors also provide API (Application

Programming Interface) for two popular languages, Python

and C++, allowing the users to easily use the OpenPose model

in their applications

the entire sequences of data, such as human speech or video sequence

A common LSTM unit is composed of a cell state, an input gate, an output gate and a forget gate The cell state is used

to remember values over arbitrary time intervals and the three gates are used to regulate the flow of information into and out

of the cell

Fig 3 The architecture of LSTM model.

The LSTM model is well-suited to classify, process and make predictions based on time series data, since there can

be lags of unknown duration between important events in a time series The LSTM model was developed to deal with the vanishing gradient problem that can be encountered when training the traditional RNNs The relative insensitivity to gap length is an advantage of the LSTM model over the RNN model, the hidden Markov model and other sequence learning methods in numerous applications

III PROPOSEDMETHOD

In this study, we divide the human social activities which are recognized for socially aware mobile robot navigation systems into five categories, as illustrated in Fig 4 The human social activities include standing, sitting, laying down, walking and running

There have been many studies proving that a person’s posture carries a lot of information, including emotions, health conditions [17] Does person’s posture contain information about the social activities of the people? To answer the question, we utilize the LSTM network, observe the person’s posture in n steps time steps, and then recognize the social activities of the humans The block diagram of the proposed system is shown in Fig 5 The proposed system consists of two phases including training and testing

The skeleton of the people is extracted from OpenPose algorithm A skeleton consists 2D coordinates of j keypoints

on the body of the people (15, 18 or 25 keypoints), therefore for each time step we have a coordinate vector X = [x1, x2, , xk] with xi∈ R and k = 2 * j As a result, the input of LSTM network is a n steps xkmatrix X , while the output are the cases

of human social activities n case (as shown above, n case = 5), which is represented as a one-hot vector

Trang 3

(a) (b) (c) (d) (e)

Fig 4 The experimental scenario of human-robot interaction: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human.

Fig 5 The block diagram of the social activities recognition algorithm.

A Data preparation

Dataset preparation is one of the most important step for

training process of the deep learning models It is crucial

and can significantly affect the overall performance, accuracy

and usability of trained model The dataset of the social

activities of the human is not available, so we created our own

dataset by recording multiple videos in different environmental

conditions

To create the time step dataset, we utilize the sliding window

technique, as shown in Fig 6 A window has n steps width,

which slides across the data series, for each step we have a the

data point and a corresponding label The label is the name

of social activities including standing, sitting, laying down,

walking and running

Fig 6 The sliding window.

TABLE I

T HE SET OF PARAMETERS

Parameters Value Parameters Value

N steps 32 Decay rate 0.96 Hidden layer 48 Decay steps 6000 Classes 5 Epochs 300 Learning rate 0.0025 Batch size 512 Init learning rate 0.005 Lambda loss amount 0.0015

The values of the keypoints in each window are written

to input set X, while the ground truth, represented by a classification label, is written to output set Y We do the same for training and testing set

B Training Process The dataset is split into two sets included 80 percent for training and 20 percent for testing It is extremely important that the training set and testing set are independent of each

Trang 4

Fig 7 The training and testing accuracy.

other and do not overlap The values of the parameters are

empirical set in Table I

The batch size and epochs number were set in different

values for training process The training process was running

automatically and finished when the pre-setting epochs number

is reached The model was saved after every certain number

of epochs At the end of the training process, we exported the

prediction results on the training set and testing set to evaluate

the newly trained model

We filmed a variety of models with different heights,

weights, and BMI (Body Mass Index) to create our datasets

The keypoints j is chosen 25, the time step n steps is chosen

32, so the input of the network is a 32x50 matrix

We tested with different number of hidden layers of the

LSTM network to find out the good parameter set The good

network training results are shown in Fig 7 In this case, we

tested with 48 hidden layers and 300 iterations

The results of evaluating the testing set on the confusion

matrix are shown in Fig 8 The evaluation results with the

testing set are very good The accuracy of the proposed model

is over 95 percent

IV EXPERIMENTALRESULTS

A Experimental Setup

In order collect the data for training and testing the proposed

model, we utilize a smartphone to represent the position of

the mobile robot, which have fullHD (1920x1080) resolusion

camera However, in order to increase the frame rate of the

prediction process, the videos are scaled to 640x480 resolution

before being fed to the proposed model The human stands

8-10[m] far away from the camera, and create the human social

activities include standing, sitting, laying down, walking and

running, as shown in Fig 4 We also record a video that

contain a combination of several human social activities to

evaluate the accuracy of the proposed algorithm

Fig 8 The confusion matrix.

The testing and training process was run on Desktop com-puter with Intel core i7-10700 CPU, 16 GB RAM and NVIDIA GerForce GTX 1650 card The computer was installed Ubuntu operating system 18.04

B Experimental results The simulation results are shown in Fig 9 A video of our experiments can be found at the link1

The proposed LSTM network model predicts very well with clear human social activities such as standing, sitting, laying down, as illustrated in Fig 9(a), 9(b) and 9(c) In addition in the more difficult cases, for example human runs or walks,

as shown in Fig 9(d) and 9(e) the output of the proposed model is quite good In this case when the human changes the moving direction, at the early frames the proposed network model may has some mistakes between running with walking activities

In addition, we conduct an experiment that combine 5 single social activities Although the movement is complicated, the LSTM network model shows the good and stable results From the achieved results, we are going to incorporate the information into socially aware navigation systems It enables

a robot to perceive humans’ intentions, leading to future trajectory predictions of them Therefore, the robot is able to avoid humans more proactively and efficiently

V CONCLUSIONS

In this article, we have presented an approach that recognize the social activities of the human for socially aware mobile robot navigation systems using deep learning techniques We make use of the OpenPose model to extract human posture and LSTM network to observe a person over a certain period

of time We then distinguish the social activities of the human

1 https://youtu.be/WM5OJJ3icIA

Trang 5

(a) (b) (c) (d) (e)

Fig 9 The examples of experimental results: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human.

in front of the mobile robot This approach initially gave some

very positive results

In the future, we will continue to develop the algorithm

by pre-processing information from the image and apply

this algorithm with multiple people In addition, we will

incorporate the social activities into the socially aware mobile

robot navigation system to evaluate its usefulness

REFERENCES [1] M Shiomi, F Zanlungo, K Hayashi, and T Kanda, “Towards a socially

acceptable collision avoidance for a mobile robot navigating among

pedestrians using a pedestrian model,,” International Journal of Social

Robotics, vol 6, no 3, pp 443–455, 2014.

[2] X T Truong and T D Ngo, “Toward socially aware robot navigation in

dynamic and crowded environments: A proactive social motion model,,”

in IEEE Transactions on Automation Science and Engineering,, 2017,

pp 1743–1760.

[3] Y F Chen, M Everett, M Liu, and J P How, “Socially aware

motion planning with deep reinforcement learning, booktitle = in 2017

IEEE/RSJ International Conference on Intelligent Robots and Systems

(IROS), year=2017,.”

[4] X T Truong and T D Ngo, “Toward socially aware robot navigation in

dynamic and crowded environments: A proactive social motion model,,”

in in ICRA 2019 Workshop on MoRobAE-Mobile Robot Assistants for

the Elderly, Montreal Canada,, 2019, pp 20–24.

[5] A Gupta, J Johnson, L Fei-Fei, S Savarese, and A Alahi, “Social gan:

Socially acceptable trajectories with generative adversarial networks,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), June 2018.

[6] K Yamaguchi, A C Berg, L E Ortiz, and T L Berg, “Who are you

with and where are you going?” in CVPR 2011, 2011, pp 1345–1352.

[7] A Alahi, K Goel, V Ramanathan, A Robicquet, L Fei-Fei, and

S Savarese, “Social LSTM: Human trajectory prediction in crowded

spaces,” in IEEE Conference on Computer Vision and Pattern Recogni-tion, June 2016, pp 961–971.

[8] F Bartoli, G Lisanti, L Ballan, and A Del Bimbo, “Context-aware trajectory prediction,” in 2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp 1941–1946.

[9] A Robicquet, A Sadeghian, A Alahi, and S Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,”

in Computer Vision – ECCV 2016 Springer International Publishing,

2016, pp 549–565.

[10] Y Li and S S Ge, “Humanrobot collaboration based on motion intention estimation,,” IEEE/ASME Transactions on Mechatronics,, vol 19, no 3,

pp 1007–1014, 2013.

[11] J S Park, C Park, and D Manocha, “I-planner: Intention-aware motion planning using learning-based human motion prediction,,” The International Journal of Robotics Research,, vol 38, no 1, pp 23–39, 2019.

[12] R Kelley, A Tavakkoli, C King, M Nicolescu, M Nicolescu, and

G Bebis, “Understanding human intentions via hidden markov models

in autonomous mobile robots,,” in in Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp 367– 374.

[13] T Bandyopadhyay, K S Won, E Frazzoli, D Hsu, W S Lee, and

D Rus, “Intention-aware motion planning,,” in Algorithmic foundations

of robotics X: Springer,, pp 475–491, 2013.

[14] Z Cao, T Simon, S.-E Wei, and Y Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,,” in in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp 7291–7299.

[15] Z Cao, G Hidalgo, T Simon, S.-E Wei, Y J I t o p a Sheikh, and

m intelligence, “Openpose: realtime multi-person 2d pose estimation using part affinity fields,,” vol 43, no 1, 2019, pp 172–186 [16] S Hochreiter and J J N c Schmidhuber, “Long short-term memory,,” vol 9, no 8, 1997, pp 1735–1780.

[17] V Narayanan, B M Manoghar, V S Dorbala, D Manocha, and

A Bera, “Clearpath: Highly parallel collision avoidance for multiagent simulation,” in arXiv preprint arXiv:2003.01062, 2020.

Tiêu đề	Socially aware robot navigation framework: Social activities recognition using deep learning techniques
Tác giả	Ngoc Anh Pham, Lan Anh Nguyen, Xuan Tung Truong
Trường học	Le Quy Don Technical University
Chuyên ngành	Robotics, Artificial Intelligence
Thể loại	Graduation project
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	5
Dung lượng	4,46 MB