Socially aware robot navigation framework: Social activities recognition using deep learning techniques Ngoc Anh Pham, Lan Anh Nguyen and Xuan Tung Truong Faculty of Control Engineering,
Trang 1Socially aware robot navigation framework: Social activities recognition using deep learning techniques
Ngoc Anh Pham, Lan Anh Nguyen and Xuan Tung Truong Faculty of Control Engineering, Le Quy Don Technical University
Hanoi, Vietnam xuantung.truong@gmail.com
Abstract—In this study, we propose a deep learning
based-social activities recognition algorithm for based-socially aware mobile
robot navigation framework The proposed method utilizes the
OpenPose library and the Long-short term memory deep learning
neural network, which observes the human skeleton in some time
steps, then predicts that the human social activities including
human running, walking, standing, sitting and laying We train
and test the proposed deep learning neural network model on
a dataset that we synthesize The experimental results illustrate
that our proposed method can predict the human social activities
with higher accuracy
Index Terms—Social activities recognition, OpenPose, LSTM,
Socially aware navigation, Mobile service robot
I INTRODUCTION
In recent years, autonomous mobile robots are increasingly
researched, developed and applied in social life and also in the
military field The strong development of the fourth scientific
and technological revolution together with the trend of
glob-alization have been a strong driving force in manufacturing
technology and the application of autonomous mobile robots
in all areas of life
Although today’s modern robot navigation systems are
capa-ble of driving the mobile robot to avoid and approach humans
in a socially acceptable manner, and providing respectful and
polite behaviors akin to the humans [1], [2] and [3], they still
surfer the following drawbacks if we wish to deploy the robots
into our daily life settings: (i) a robot should react according
to social cues and signals of humans (facial expression, voice
pitch and tone, body language, human gestures), (ii) a robot
should predict future action of the human [4], and (iii) a robot
should be able to estimate the social activities of the human
in its vicinity
Robots navigate in social environments affected greatly
by human navigation processes as well as the decisions of
humans Therefore robots can make better decisions if they
knew in advance plans that humans will make in the future to
foresee trajectories of them The previous research works in
trajectory prediction has some challenges due to the inherent
properties of human motion in crowded scenes [5] such as
interpersonal, socially acceptable and multiple trajectories
The traditional methods based on hand-crafted features [6]
have addressed exhaustively in term of interpersonal problems
Methods based on Recurrent Neural Networks (RNNs) [7]
and [8] can effectively adapt in term of socially acceptable
aspect The methods used Long Short-Term Memory networks (LSTM) to jointly reason across multiple agents to predict their trajectories in a scene Beside, the problem of multiple trajectories has been studied in the context of route choices
in given scene [5] Moreover, in [9] the authors demonstrate that pedestrians have different navigation choices in crowded scenes depending on their personal properties
Nevertheless, in order to investigate human movement as well as support human future trajectory predictions, predicting human social activities is a very important part Because it allows a mobile robot to automatically predict situations of humans to actively set up respective action scenarios Human social activities prediction has been studied and incorporated into robotic systems such as applied to trajectory planning
of robot arms [10] and [11], mobile robots [12], and au-tonomous driving [13] The authors used Hidden Markov Models (HMMs) to model activities and recognize human intents [11] and [12] Beside, by employing radial basis func-tion neural networks (RBFNN) in [10], the mofunc-tion intenfunc-tion
of the human has been estimated However, the systems have suffered from a limitation if the number of human action types increases or in crowded scenes
Therefore, in this study, we propose an improving system
to predict social activities, including standing, sitting, lying and walking human, which uses the output of the Open Pose model and a deep Long-Short Term Memory network
We aim to apply the output of this system to the socially aware navigation system of mobile robots which helps them avoid human effectively in crowded environments Because it enables the robot understand human’s intentions and foresee their future trajectories
The remainder of this paper is organized as follows Sec-tion II introduces the background informaSec-tion that will be utilized in the paper The human social activities prediction algorithm using deep learning techniques is presented in Section III Section IV shows the experimental results The conclusion is provided in Section V
II BACKGROUNDINFORMATION
A The Overview of the OpenPose Model The OpenPose model is a real-time multi-person keypoint detection library for body, face, hands and foot estimation The OpenPose model was created by CMU Perceptual Computing
Trang 2of a convolutional neural network with two branches, in which
the first branch was the reliability map and the second branch
was the polynomial activation functions set
Fig 1 The architecture of the OpenPose algorithm.
The inputs of the OpenPose model are the image, video,
webcam, Flir/Point Grey and IP Camera The outputs of
the OpenPose model are the basic image and keypoint
dis-play/saving in popular image formats such as PNG, JPG,
AVIm, etc or keypoint saving as JSON, XML, etc The
number of body keypoints that can be exported from OpenPose
model is 15, 18 or 25-keypoints
Fig 2 The example result of the OpenPose algorithm
In particular, the authors also provide API (Application
Programming Interface) for two popular languages, Python
and C++, allowing the users to easily use the OpenPose model
in their applications
the entire sequences of data, such as human speech or video sequence
A common LSTM unit is composed of a cell state, an input gate, an output gate and a forget gate The cell state is used
to remember values over arbitrary time intervals and the three gates are used to regulate the flow of information into and out
of the cell
Fig 3 The architecture of LSTM model.
The LSTM model is well-suited to classify, process and make predictions based on time series data, since there can
be lags of unknown duration between important events in a time series The LSTM model was developed to deal with the vanishing gradient problem that can be encountered when training the traditional RNNs The relative insensitivity to gap length is an advantage of the LSTM model over the RNN model, the hidden Markov model and other sequence learning methods in numerous applications
III PROPOSEDMETHOD
In this study, we divide the human social activities which are recognized for socially aware mobile robot navigation systems into five categories, as illustrated in Fig 4 The human social activities include standing, sitting, laying down, walking and running
There have been many studies proving that a person’s posture carries a lot of information, including emotions, health conditions [17] Does person’s posture contain information about the social activities of the people? To answer the question, we utilize the LSTM network, observe the person’s posture in n steps time steps, and then recognize the social activities of the humans The block diagram of the proposed system is shown in Fig 5 The proposed system consists of two phases including training and testing
The skeleton of the people is extracted from OpenPose algorithm A skeleton consists 2D coordinates of j keypoints
on the body of the people (15, 18 or 25 keypoints), therefore for each time step we have a coordinate vector X = [x1, x2, , xk] with xi∈ R and k = 2 * j As a result, the input of LSTM network is a n steps xkmatrix X , while the output are the cases
of human social activities n case (as shown above, n case = 5), which is represented as a one-hot vector
Trang 3(a) (b) (c) (d) (e)
Fig 4 The experimental scenario of human-robot interaction: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human.
Fig 5 The block diagram of the social activities recognition algorithm.
A Data preparation
Dataset preparation is one of the most important step for
training process of the deep learning models It is crucial
and can significantly affect the overall performance, accuracy
and usability of trained model The dataset of the social
activities of the human is not available, so we created our own
dataset by recording multiple videos in different environmental
conditions
To create the time step dataset, we utilize the sliding window
technique, as shown in Fig 6 A window has n steps width,
which slides across the data series, for each step we have a the
data point and a corresponding label The label is the name
of social activities including standing, sitting, laying down,
walking and running
Fig 6 The sliding window.
TABLE I
T HE SET OF PARAMETERS
Parameters Value Parameters Value
N steps 32 Decay rate 0.96 Hidden layer 48 Decay steps 6000 Classes 5 Epochs 300 Learning rate 0.0025 Batch size 512 Init learning rate 0.005 Lambda loss amount 0.0015
The values of the keypoints in each window are written
to input set X, while the ground truth, represented by a classification label, is written to output set Y We do the same for training and testing set
B Training Process The dataset is split into two sets included 80 percent for training and 20 percent for testing It is extremely important that the training set and testing set are independent of each
Trang 4Fig 7 The training and testing accuracy.
other and do not overlap The values of the parameters are
empirical set in Table I
The batch size and epochs number were set in different
values for training process The training process was running
automatically and finished when the pre-setting epochs number
is reached The model was saved after every certain number
of epochs At the end of the training process, we exported the
prediction results on the training set and testing set to evaluate
the newly trained model
We filmed a variety of models with different heights,
weights, and BMI (Body Mass Index) to create our datasets
The keypoints j is chosen 25, the time step n steps is chosen
32, so the input of the network is a 32x50 matrix
We tested with different number of hidden layers of the
LSTM network to find out the good parameter set The good
network training results are shown in Fig 7 In this case, we
tested with 48 hidden layers and 300 iterations
The results of evaluating the testing set on the confusion
matrix are shown in Fig 8 The evaluation results with the
testing set are very good The accuracy of the proposed model
is over 95 percent
IV EXPERIMENTALRESULTS
A Experimental Setup
In order collect the data for training and testing the proposed
model, we utilize a smartphone to represent the position of
the mobile robot, which have fullHD (1920x1080) resolusion
camera However, in order to increase the frame rate of the
prediction process, the videos are scaled to 640x480 resolution
before being fed to the proposed model The human stands
8-10[m] far away from the camera, and create the human social
activities include standing, sitting, laying down, walking and
running, as shown in Fig 4 We also record a video that
contain a combination of several human social activities to
evaluate the accuracy of the proposed algorithm
Fig 8 The confusion matrix.
The testing and training process was run on Desktop com-puter with Intel core i7-10700 CPU, 16 GB RAM and NVIDIA GerForce GTX 1650 card The computer was installed Ubuntu operating system 18.04
B Experimental results The simulation results are shown in Fig 9 A video of our experiments can be found at the link1
The proposed LSTM network model predicts very well with clear human social activities such as standing, sitting, laying down, as illustrated in Fig 9(a), 9(b) and 9(c) In addition in the more difficult cases, for example human runs or walks,
as shown in Fig 9(d) and 9(e) the output of the proposed model is quite good In this case when the human changes the moving direction, at the early frames the proposed network model may has some mistakes between running with walking activities
In addition, we conduct an experiment that combine 5 single social activities Although the movement is complicated, the LSTM network model shows the good and stable results From the achieved results, we are going to incorporate the information into socially aware navigation systems It enables
a robot to perceive humans’ intentions, leading to future trajectory predictions of them Therefore, the robot is able to avoid humans more proactively and efficiently
V CONCLUSIONS
In this article, we have presented an approach that recognize the social activities of the human for socially aware mobile robot navigation systems using deep learning techniques We make use of the OpenPose model to extract human posture and LSTM network to observe a person over a certain period
of time We then distinguish the social activities of the human
1 https://youtu.be/WM5OJJ3icIA
Trang 5(a) (b) (c) (d) (e)
Fig 9 The examples of experimental results: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human.
in front of the mobile robot This approach initially gave some
very positive results
In the future, we will continue to develop the algorithm
by pre-processing information from the image and apply
this algorithm with multiple people In addition, we will
incorporate the social activities into the socially aware mobile
robot navigation system to evaluate its usefulness
REFERENCES [1] M Shiomi, F Zanlungo, K Hayashi, and T Kanda, “Towards a socially
acceptable collision avoidance for a mobile robot navigating among
pedestrians using a pedestrian model,,” International Journal of Social
Robotics, vol 6, no 3, pp 443–455, 2014.
[2] X T Truong and T D Ngo, “Toward socially aware robot navigation in
dynamic and crowded environments: A proactive social motion model,,”
in IEEE Transactions on Automation Science and Engineering,, 2017,
pp 1743–1760.
[3] Y F Chen, M Everett, M Liu, and J P How, “Socially aware
motion planning with deep reinforcement learning, booktitle = in 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), year=2017,.”
[4] X T Truong and T D Ngo, “Toward socially aware robot navigation in
dynamic and crowded environments: A proactive social motion model,,”
in in ICRA 2019 Workshop on MoRobAE-Mobile Robot Assistants for
the Elderly, Montreal Canada,, 2019, pp 20–24.
[5] A Gupta, J Johnson, L Fei-Fei, S Savarese, and A Alahi, “Social gan:
Socially acceptable trajectories with generative adversarial networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[6] K Yamaguchi, A C Berg, L E Ortiz, and T L Berg, “Who are you
with and where are you going?” in CVPR 2011, 2011, pp 1345–1352.
[7] A Alahi, K Goel, V Ramanathan, A Robicquet, L Fei-Fei, and
S Savarese, “Social LSTM: Human trajectory prediction in crowded
spaces,” in IEEE Conference on Computer Vision and Pattern Recogni-tion, June 2016, pp 961–971.
[8] F Bartoli, G Lisanti, L Ballan, and A Del Bimbo, “Context-aware trajectory prediction,” in 2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp 1941–1946.
[9] A Robicquet, A Sadeghian, A Alahi, and S Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,”
in Computer Vision – ECCV 2016 Springer International Publishing,
2016, pp 549–565.
[10] Y Li and S S Ge, “Humanrobot collaboration based on motion intention estimation,,” IEEE/ASME Transactions on Mechatronics,, vol 19, no 3,
pp 1007–1014, 2013.
[11] J S Park, C Park, and D Manocha, “I-planner: Intention-aware motion planning using learning-based human motion prediction,,” The International Journal of Robotics Research,, vol 38, no 1, pp 23–39, 2019.
[12] R Kelley, A Tavakkoli, C King, M Nicolescu, M Nicolescu, and
G Bebis, “Understanding human intentions via hidden markov models
in autonomous mobile robots,,” in in Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp 367– 374.
[13] T Bandyopadhyay, K S Won, E Frazzoli, D Hsu, W S Lee, and
D Rus, “Intention-aware motion planning,,” in Algorithmic foundations
of robotics X: Springer,, pp 475–491, 2013.
[14] Z Cao, T Simon, S.-E Wei, and Y Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,,” in in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp 7291–7299.
[15] Z Cao, G Hidalgo, T Simon, S.-E Wei, Y J I t o p a Sheikh, and
m intelligence, “Openpose: realtime multi-person 2d pose estimation using part affinity fields,,” vol 43, no 1, 2019, pp 172–186 [16] S Hochreiter and J J N c Schmidhuber, “Long short-term memory,,” vol 9, no 8, 1997, pp 1735–1780.
[17] V Narayanan, B M Manoghar, V S Dorbala, D Manocha, and
A Bera, “Clearpath: Highly parallel collision avoidance for multiagent simulation,” in arXiv preprint arXiv:2003.01062, 2020.