Applying reinforcement learning for autonomous robot navigation in unknown environments

24 3.4 The autonomous robot finds a path to reach the goal on a map sized 40 × 40 when moving on RRT* tree after the training process 25 3.5 Possible actions of the agent with a vision r

Trang 5

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING

GRADUATION THESIS

APPLYING REINFORCEMENT LEARNING FOR AUTONOMOUS ROBOT NAVIGATION IN

UNKNOWN ENVIRONMENTS

Major: COMPUTER ENGINEERING

THESIS COMMITTEE :COMPUTER ENGINEERING

:TRAN THANH BINH, M.S

Trang 6

We proclaim that this is our research work, which is being supervised byAssoc Prof Ph.D Tran Van Hoai and M.S Tran Thanh Binh The materials andresults of the research in this topic are legitimate and have never been published inany format The contents of the analysis, survey and evaluation were gathered bythe writers from various sources and are cited in the reference section In addition,figures and models from other groups of writers were used in the study, and thesources were cited and annotated If any fraud is discovered, we undertake toaccept full responsibility for the research’s substance Ho Chi Minh City University

Of Technology has nothing to do with copyright infringement

Trang 7

Firstly, we want to sincerely thank all of our advisors, M S Tran ThanhBinh and Assoc Prof Tran Van Hoai, for their enthusiasm and patience They havealso given us excellent supervision and guidance which have helped us tremendously

at all times during our research

Additionally, we would like to thank all of the lecturers at the Faculty ofComputer Science and Engineering as well as all of the faculty members of Ho ChiMinh City University of Technology in general, for their dedication to teachingand assisting the group in acquiring the knowledge and skills required Show thegroup how much fun the lessons are so they may keep pursuing their passions andachieving the group’s goals

The group would also like to thank all the teachers in the Faculty ofComputer Science and Engineering in particular and the teachers of Ho Chi MinhCity University of Technology in general, for their dedicated teaching and helpingthe group to grasp the knowledge and skills needed by the group Show the groupthe excitement in the lessons so that they can continue to pursue their passionsand conquer the group’s ambitions

Trang 8

In recent years, robots that are mobile or autonomous have become moreand more common in industries One of the most crucial areas of the field is thestudy of path-finding techniques that enable the robot to move efficiently from thestarting point to the goal avoiding obstacles Although there are applications ofmathematics and algorithms to solve the task in known static environments, a lot

of additional research is required to adapt autonomous robots to unknown staticenvironments and even dynamic environments The Reinforcement Learning (RL)approach is thus put into the study as one of the potential solutions to solve suchchallenges

In this thesis, we aim to describe the research on path-planning and based-planning algorithms for the autonomous robot with a limited vision range

sampling-to find the path sampling-to the pre-determined goal in static environments avoiding 2Duncertainties that are polygonal obstacles In addition, we researched the basics ofthe real autonomous robot (TurtleBot 3) which are capable of performing tasks in

an unknown environment itself without explicit human control Most importantly,

we studied the principles of RL and then implemented an RL-based mechanismthat allows the autonomous robot to move in unknown static environments Thechallenge of applying RL to different maps or dynamic environments is an option

to research but not an urgent one

Based on the analysis of the simulation results, we demonstrate the feasibilityand efficiency of the proposed approach in comparison with others published at [1],opening up opportunities in the future to add more features to create robots thatare appropriate for moving in dynamic situations with an unknown set of obstacles

Trang 9

This thesis is arranged into five chapters including:

Chapter 1: Introduction

Chapter 2: General knowledge

Chapter 3: Methodology and implementation for RL system

Chapter 4: Metrics and simulation results in comparison with other approaches.Chapter 5: Conclusion

Trang 10

1.1 General 1

1.2 Background 2

1.3 Scope of Work 2

2 Preliminary Studies 3 2.1 Path Planning 3

2.1.1 Global Path Planning 4

2.1.2 Local Path Planning 4

2.2 RRT-based algorithms 4

2.2.1 RRT 5

2.2.2 RRT* 5

2.3 Reinforcement Learning 6

2.3.1 Introduction 6

Trang 11

2.3.2 Terminologies 7

2.3.3 The Structure of an RL System: Markov Decision Proccess 9 2.3.4 Goals and Rewards 10

2.3.5 Returns and Episodes 11

2.3.6 Designing Reward Function 13

2.3.7 Policies 16

2.3.8 Value Functions 16

2.3.9 Optimal Policies and Optimal Value Functions 17

2.3.10 Q-Learning 18

2.3.11 Bellman Equation 19

2.3.12 Exploration or Exploitation with ϵ-Greedy 20

3 Proposed RL System 22 3.1 Notations 22

3.2 Problem Specification 24

3.3 Framing Problem into RL system 26

3.4 General Operations of the Agent 30

3.5 Training Configuration 33

4 Experimental Results 34 4.1 Selecting Metrics 34

4.2 Simulation and Comparison in Python Environment 34

4.2.1 Testing environments 34

4.2.2 Testing Convergence 39

4.2.3 Comparing with RRTX 42

5 Conclusion 52 5.1 Accomplishment 52

5.2 Future Works 52

Trang 12

List of Figures

2.1 An illustration of global path planning and local path planning 3

2.2 RRT* has denser space exploration and path refinement than RRT [11] 6

2.3 An established framework for reinforcement learning 10

2.4 Time and the discount factor’s impact on the value of rewards 12

2.5 Visualization of the Inverted Pendulum Swingup problem 14

2.6 Visualization of the Lunar Lander problem 15

2.7 Bellman Equation for Updating Q value 19

3.1 The map sized 70 × 70 with obstacles and free space 23

3.2 The map sized 70 × 70 with one RRT* tree and one RRT* path connecting starting node and the goal 23

3.3 Components in the Python simulation environment 24

3.4 The autonomous robot finds a path to reach the goal on a map sized 40 × 40 when moving on RRT* tree after the training process 25 3.5 Possible actions of the agent with a vision range of 3 units at a given state 27

3.6 The agent can take actions marked with green dots within its vision range in other to move to another tree node 27

3.7 The agent primarily prefers to choose the state that is closer to the goal and obstacle 29

3.8 Operation flow chart of the agent 31

3.9 The agent follows the RRT* paths colored in blue 32

Trang 13

List of Figures

appearance, the agent observes to choose another green-colored

tree node within its vision range 32

4.1 Map with dead ends 1 35

4.4 Map with convex polygon obstacles 1 37

4.7 Extended obstacle 38

4.8 Map with 4 obstacles 39

4.9 Total rewards each episode converged during the training process 40 4.10 The path length of agent each 100 episode 41

4.11 Map with convex polygon obstacles 42

4.12 Map with 3 obstacles example 1 43

4.18 Known map with 4 obstacles 47

4.19 Unknown maps from known map 48

4.20 Map with shifting bottom-left obstacle 48

4.21 Map with adding an obstacle 49

4.22 Map with adding many obstacles 49

Trang 14

ROS Robot Operating System

RL Reinforcement Learning

MDP Markov Decision Proccess

API Application Programming Interface

RRT Rapidly Exploring Random Tree

PPCR Path planning of coverage region

ASP Approximate shortest path

Trang 15

1 Introduction

People in recent years have become more oriented on their careers, making

it more difficult for them to maintain both their homes and offices at the same timedue to their inconsistent work schedules Most of the time, they employ cleaners

to keep home spaces, offices, etc clean However, they do not trust them

To overcome the problem, vacuum robots have been developed using moreadvanced technologies and are designed to automate the cleaning process Whilethese robotic cleaners have made home cleaning easier, they have to scan everypossible area in the whole room environment to plan a path of coverage region(PPCR) [3]

Another real-world example is autonomous mobile robots (AMRs) whichare really helpful in factories These robots can be transporters of goods in thefactory They come around to allocate the items for departments and then return

to their "home" also known as the warehouse to continue receiving the goods orcharge the battery Like vacuum robots, one of the main disadvantages of AMRs isthat they travel along a pre-determined map

Several path-planning approaches are proposed [4], [5], and [6] However,these techniques are not highly efficient in a real environment simply because theyneglect unknown dynamic environments In other words, the general problem ofthese sorts of autonomous robots is that the common path-finding methods do notguarantee to run in dynamic environments

Therefore, it is worth it for us to study and apply Reinforcement Learning

Trang 16

1.2 Background

Our thesis is based on the previous study of "Developing a method finding

geometric shortest paths with machine learning support to estimate the distance to destination" [2] The study has achieved objectives such as:

• Implementing Local First strategy to the current algorithm

• Implementing flexible vision range mechanism

• Optimizing the robot’s path on the same map in the Python environmentwith the help of RL-liked ideas

In summary, their thesis has applied the algorithm to find the shortest path

- The Approximate Shortest Path, onto the real robot Moreover, they also appliedthe RL-based ideas (not a real RL with a detailed mechanism) to the robot tolearn the map in the simulation environment

Our main goal is to concentrate on applying the RL technique to theautonomous robot with RRT* algorithm to find efficient paths to the goal in anunknown static environment In addition, we limit the criteria to the followingrequirements:

• The obstacles are convex polygons

• The autonomous robot has to be trained on a 2-dimensional static map or

environment, given starting points (s x , s y ) and a goal (g x , g y) In other words,one destination provided will be paired with different starting points

• After the training process on a particular map, the autonomous robot iscapable of reaching the target with effective paths that walk along obstaclesrather than colliding with or coming across obstacles

• With additional or shifted obstacles in the map the autonomous robot hasbeen trained, it is still able to reach the goal

• The robot must follow the RRT* path and uses RL to choose another node

on the tree whenever the robot meets obstacles, and then repeat the processuntil reaching the goal

Trang 17

2 Preliminary Studies

The Autonomous Robot system is implemented by using Path-planningmethod In Path-planning method, there is two categories: local and global Theglobal path-planning method computes a path offline base on the whole mapinformation, while the local path-planning method calculates the route based onthe local visibility of its current location and the places it passed

Figure 2.1: An illustration of global path planning and local path planning

Trang 18

2.1.1 Global Path Planning

Global path planning approaches need previous knowledge about the robot’senvironment Therefore, the global map is also referred to as a static map Two

of the most often used algorithms for global path planning are heuristic searchingmethodology and a collection of Intelligent algorithms

Local path planning is based on nearby surroundings information and robotstate assessment, with the goal of dynamically designing a simulated path withoutcolliding obstacles Path planning in a dynamic environment is a more complexitychallenge due to unpredictable factors such as the movements of obstacles in thedynamic environment In this situation, path planning algorithms must be flexible

to the dynamic properties of the environment by gathering information aboutunknown areas of the environment

Basically, the idea of RRT-based algorithms is that RRTs take samples fromthe space and add them to a tree that expands to encompass the entire planningspace Particularly, RRTs [7] have the important features of 1) traversing the entirespace effectively and fast; and 2) probabilistic completeness, i.e the likelihood offinding a solution towards one as the number of nodes in the tree rises However,RRTs are not convergence rate optimum, and the RRT method does not involverewiring, thus connections between nodes are only set once The advent of theRRT* method [8] addressed this by enabling to rewire the tree connections andthus reducing the path length from the root to the leaf In other words, RRT*,[9] is a category of RRT-based planners that try to develop real-time solutions forhigh-dimensional systems by gradually exploring in lower-dimensional subspaces

RRT and RRT* work in configuration spaces, which are collections of all

potential robot transformations [10] [11] Let the set Z ⊂ R n , n ∈ N represent the provided state space, where n denotes the size of the supplied search space The region of the search space filled by obstacles is denoted by Z obs ⊂ Z and the region

free from obstacles is represented by Z f ree = Z \ Z obs z goal ⊂ Z f ree represents

goal and z init ⊂ Z f ree represents starting point z goal and z init are sent to theplanner as inputs Then we identify a path that is free of collisions between initial

Trang 19

the set of obstacles must be identified in order to eliminate the space that theseobstacles occupy In other words, these sampling-based planning algorithms arehighly appreciated for known environments.

RRT is a method that creates random space-filling trees to efficiently explorenon-convex high-dimensional regions Trees are constructed from randomly selectedsamples from the search space and are fundamentally geared to develop toward bigunexplored sections of the problem

RRT grows a tree rooted at the beginning configuration by using randomsamples from the search space As each sample is drawn, an attempt is made toconnect it to the nearest state in the tree A new state is added to the tree if thelink is viable

Growth factors frequently limit the length of the relationship between thetree and the new state If a random sample is more than this growth factors awayfrom the nearest state in the tree, the new state with the largest distance along theline from the tree to the random sample is utilized instead of the random sampleitself The direction of tree growth can be controlled by random sampling, whilethe rate is determined by the growth factor

One of the variants of RRT is RRT* which converges toward an optimalsolution For RRT*, the selection of the node is nearly the same as the RRTalgorithm And the difference comes from the way RRT* connects the newestcreated nodes As RRT, the newest node will link to the nearest node in the tree.However, in RRT*, it is not necessary for the newest one connects to the nearestone Instead, it will look for other nodes within a certain radius and see if it canconnect them locally in a way that preserves the tree structure while simultaneouslyoptimizing total path length [13].The figure 2.2 depicts space exploration and pathquality improvement Because of its asymptotic quality, RRT* gradually decreasesits path cost as the number of iterations grows On the other hand, RRT does notimprove the inefficient path

Trang 20

Figure 2.2: RRT* has denser space exploration and path refinement than RRT [11]

We discuss the fundamental knowledge and ideas of RL in this section.Furthermore, we introduce how can the RL problem be defined as an agentthat comes to decisions in an environment in order to maximize the cumulativerewards

In general, RL is a branch of machine learning that provides automatedtechniques for identifying patterns in data and then using them to do certaintasks RL involves learning how to optimize a numerical reward signal by matchingevents and actions Learners are not instructed what behaviors to do, ratherthey must attempt to determine which acts yield the largest reward As a result,they just need to be capable to communicate with the environment and acquireinformation; they do not need to have an in-depth understanding of or controlover the environment Furthermore, in the most intriguing and complex instances,actions might have an impact not just on the current reward but also on the nextscenario and hence all following rewards These two essential defining characteristics

of RL are trial-and-error and delayed reward

RL distinguishes from supervised learning, which is the sort of learning

investigated in the majority of present machine learning research A competent

Trang 21

external supervisor provides a training set of labeled examples for supervisedlearning Each example contains a description of the circumstance as well as aspecification (label) of the appropriate action the system should perform in thatcase Labels are frequently used to designate the category to which the conditionbelongs The idea of this approach of learning is for the system to extrapolate orgeneralize the response so that it responds appropriately in scenarios that were notincluded in the training set.

RL is also distinct from unsupervised learning, which is often focused on

discovering structures contained in sets of unlabeled data Although one would betempted to think of RL as a type of unsupervised learning since it does not rely onexamples of proper behavior, RL aims to maximize the reward signal rather thanfinding hidden structures

As a result, RL is classified as a third machine learning paradigm, alongsidesupervised learning and unsupervised learning

One more difference between RL and other learning methods is the trade-offbetween exploration and exploitation To maximize rewards, the RL agent shouldprioritize actions that have already been attempted and proved to be beneficial

in producing rewards To investigate such actions, however, actions that were notpreviously picked must be undertaken In other words, the agent not only needsexploit its learning knowledge about the environment to get the reward, but also italso must explore the map for the better future action choice

Trang 22

An RL problem’s objective is described by a reward The reinforcementlearning agent receives a single value, known as the reward, from the environmentafter each action The sole purpose of the agent is to increase the overall benefitobtained over the long run Therefore, the reward signal defines which events areadvantageous to the agent and which are not.

State

It signifies the current condition yielded by the environment

Policy

A policy describes how the learning agent should react to the environment

at a specific time A policy, broadly speaking, is an action the agent perform

at a given state to the environment and this action selection is effected throughperceiving the environment of the agent A policy may be a function or lookuptable, or it may need complex computations like search procedures

Episode

An episode is the whole set of actions and states that the agent went throughfrom the beginning position to the goal

Value

It is a demandable return for the long-term having a discount in comparison

to the short-term reward

Value Function

A good long run is defines by the value function In general, the value of astate is the entire amount of reward that an agent might expect to acquire in thefuture, beginning from that state

Model

This is something that mimics the behavior of the environment, or moregenerally, that allows inferences to be made about how the environment will behave.Models are used for planning, by which we mean any way of deciding on a course ofaction by considering possible future situations before they are actually experienced.Methods of using models and plans to solve RL problems are called model-basedmethods This is in contrast to the simple model-free method The model-free

Trang 23

method is clearly trial-and-error learning and is considered almost the opposite ofplanning.

An MDP is formalized as a tuple of 5 components (S,A,T,R,γ), where

• S illustrates the set of possible states (the state space).

• A represents the set of possible actions (the action space).

• T : S × A × S → [0,1] indicates the probability that a state s may change to state s′ under action a.

• R : S × A × S → R shows the immediate reward that the agent receives after moving from state s to s′ because of action a.

• γ ∈ [0,1] is a measure of the discount factor that defines the weight of potential

rewards

More specifically, the agent participates in discrete time step interactions with

its environment, t = 0,1,2, The agent is given some representation of the

A t ∈ A based on this state As a result, we have the state-action pair (S t , A t)

Then, time is increased to the next time step t+1, and the agent is changed

R t+1 ∈ R for taking the action A t from state S t Figure 2.3 illustrates the typical

Trang 24

Once it comes to mapping state-action pairs to rewards, we can think

of the procedure for receiving a reward as an arbitrary function f We have

f (S t , A t ) = R t+1 at each time t.

Figure 2.3: An established framework for reinforcement learning

One of the most interesting characteristics of RL is to use a reward signal

to describe the concept of a goal The reward is a specific signal that is supplied

from the environment to the agent that defines the purpose of the agent A reward

maximize its overall returned rewards This includes optimizing cumulative rewardover time rather than immediate reward

At first, thinking about goals in terms of reward signals may seem strictive However, in practice, it proved to be adaptable and broadly useful Forexample, to build a robot Learn where to locate and how to gather empty sodacans to recycle In most cases, researchers might award it a reward of 0 and then

con-a rewcon-ard incremented by 1 for econ-ach ccon-an be gcon-athered When the robot runs intoobstacles, researchers might also wish to punish it with negative rewards Anotherexample is that to teach an agent how to play chess, the rewards are +1 for victory,-1 for defeating, and 0 for both drawing and non-terminal games

All of these examples make it clear what is happening The agent iscontinually learning how to improve its reward We must provide it rewards thatwill enable it to maximize them while also accomplishing our goals if we want

it to do something for us Therefore, it is crucial that the rewards we set upaccurately reflect what we want to achieve For instance, if a chess-playing agent’smain objective is just to win the game, it should only receive rewards when itreally wins, rather than it captures its opponent’s pieces or maintains control of

Trang 25

the board’s center The agent may discover a method to attain these kinds ofsubgoals without reaching the main goal, which is to win the match, if doing sowere awarded.

The long-term goal of the MDP agent is to maximize its cumulative reward

We now introduce a way to formalize the cumulative reward and the idea of the

expected return of the rewards at a specific time step

The return is just the total of upcoming rewards The return G at time t

is mathematically defined

where T is the last time step.

This strategy is reasonable in applications where the idea of the last timestep arises naturally, i.e., when the interaction between the agent and environmentnaturally divides into episodes (or trials), such as trips through a maze, plays

of a game, or any kind of repeated interactions Every episode finishes in a

unique condition known as the terminal state, then either a sample from a typical

distribution of starting states or a typical starting state is used to reset the system.For instance, several ways, such as winning or losing a game, can finish an episode.Regardless of how the previous episode ends, the next one begins differently Theepisodes could be thought of as having many terminal states with various rewards

for the various results These kinds of tasks are referred to as episodic tasks In episodic tasks, the time of termination, T, is an unpredictable factor that frequently

varies from episode to episode

The agent-environment interaction frequently does not simply divide intosuch distinct episodes but instead continues without end This would be a naturalapproach for developing a program for a robot with a long lifespan or an ongoing

process-control task, for example We refer to these as continuing tasks As a result,

the return formulation (2.1) is challenging for ongoing tasks due to the final time

step would be T = ∞, and the return we want to maximize is potentially limitless.

The possibility of lengthy or unlimited sequences of time steps leads to thesum of rewards keeps growing uncontrollably Unboundedness is another term for

Trang 26

To dramatically lower the value of future rewards, we use a positive real value

called discount factor γ (0 ≤ γ ≤ 1) Over time, the discount factor influences

how valuable rewards are Moreover, reducing the variance of return estimates isanother significant consideration in the use of the discount factor which helps tostabilize learning for agents In other words, Given that the future is unknownand that as we go deeper into the future, stochasticity increases and our valueestimates have greater variance Figure 2.4 illustrates the impact of time step anddiscount factor on the value of rewards in case of +1 reward given, for example

Figure 2.4: Time and the discount factor’s impact on the value of rewards

According to this approach, we have the discounted return

G t = R t+1 + γR t+2 + γ2R t+3 + = X∞

k=0

The following is an example of the value of γ:

bounded, has a finite value

• γ = 0: the agent is "myopic" in that it is solely concerned with maximizing immediate rewards Its goal is to learn how to pick A t such that it maximizes

just R t+1

• γ = 1: when the return target emphasizes future rewards, the agent becomes

more foresighted

Trang 27

From the formulation (2.2), we can demonstrate how returns at subsequent timesteps are connected to one another in (2.3).

G t = R t+1 + γR t+2 + γ2R t+3 + γ3R t+4 +

= R t+1 + γ(R t+2 + γR t+3 + γ2R t+4 + )

= R t+1 + γG t+1

(2.3)

The most crucial factor to take into account in setting up a reward function

is deciding on the agent’s terminal conditions In detail, terminal conditions arethe conditions under which the environment will reset back to the initial conditionsand a new episode will begin Moreover, terminal conditions also avoid wastingtime in case of the agent gets stuck For instance, if a robotic arm wants to carry

an object to a place, it may not be useful to continue when the robotic arm hasdropped the object You might want it to learn to recover but if you are justworried about learning to place it properly, then in case it takes an action thatcauses the object to fall, we could reset back to the initial conditions In otherwords, we need to reset the environment to focus on the most meaningful states.Basically, we can break terminal conditions down into three groups:

• Time limits help to reduce the amount of training time allocated to each

episode When an agent is trying an approach that is never going to work, it

is typically helpful to end the episode and send it back to the beginning sothat it can try again

• Positive terminals are the signals where the agent has succeeded For example,

the robotic arm successfully picks up the object Under these conditions, thetask in an episode of the agent is considered to be finished, it needs to berestarted to practice from the beginning

• Negative terminals are the signals where the agent has failed For example,

the robotic arm drops the object when carrying Under negative terminals,

we also need to restart the system and give it another chance to try

The structure of a reward function can be simple as a single formula orcomplex with many conditions

Trang 28

Let us take the Inverted Pendulum Swingup problem example [15] Theagent must swing the inverted pendulum in the environment up in order to keepits upright position with the smallest amount of effort.

Figure 2.5: Visualization of the Inverted Pendulum Swingup problem

Three variables (cos(θ), sin(θ), ω) make up the vector of the state space S

representing the cosine of the angle of the rod, the sine, and the angular velocity,

respectively The action space A is a continuous variable in [−2,2] Rotating

in either a clockwise or counterclockwise direction is the action The idea is toeffortlessly maintain a perfect balance while standing straight The reward function

is a straightforward equation without many restrictions depending on the angle,speed, and effort:

where θ represents the angle of the pendulum normalized between [−π,π] with

0 being in the upright place According to the equation, when the pendulum isupright with zero velocity and no action applied, the received maximum reward iszero

Complex Reward Function

Let us take the Lunar Lander environment example [16] This kind ofenvironment is a well-known rocket trajectory optimization problem The agenttries to land the given pad successfully using as little fuel as possible

Trang 29

Figure 2.6: Visualization of the Lunar Lander problem

The state space S is an 8-dimensional vector: the lander’s x and y

co-ordinates, linear and angular velocities, angle, and two booleans that indicatewhether or not each leg is in touch with the ground Additionally, there are fourdistinct actions that can be taken: do nothing, fire the left orientation engine,fire the main engine, and fire the right orientation engine Following are someexemplary illustrations of reward and punishment ideas for the Lunar Landerreward function:

• If the agent land in the proper place at a low enough velocity, it receives higherrewards

• If the agent touched down away from the landing pad, it receives a penalty

• Based on the amount of fuel left, give the agent a reward

• If the velocity is higher than the threshold when landing on the surface, aheavy penalty will be given to the agent

• Reward the agent for traveling a distance to the target

The reward function can be complex with many conditions:

Trang 30

crite-policy can return single actions for a given state or action-probability distributions.

• Deterministic: Describing the policy by π(s) : S → A.

• Stochastic: Describing the policy by π(a|s) : S × A → [0,1], where π(a|s) indicates the probability for choosing action a at state s Note that, π is a probability distribution over a ∈ A(s), for every state s ∈ S.

Algorithms for RL almost generally require estimation value functions, or

state functions (or state-action combinations) that calculate how good it is for theagent to stay in a certain state (or how effective it is to carry out a certain action

in a given condition) In this sense, "how good" is defined in terms of potentialrewards in the future, or more particularly, in terms of expected return Valuefunctions are therefore specified in relation to policies Formally, we say that anagent "follows a policy"

Value functions are functions of states or of state-action pairs, as was

previously described For every practiced policy π, we have the two value functions

Trang 31

that are interrelated: the state-value function and the action-value function (Q

function)

State-Value Function

a certain state is for the agent using policy π To put it another way, it gives the

value of a state under π.

Formally, the value of state s applying policy π is the expected return when

v π (s) = E π [G t |S t = s] = E π





∞ X

policy π Or, to put it another way, it offers the value of action under π.

Formally, the expected return when beginning in state s at time t, performing action a, and following to policy π is what is indicated by the value of action a in

q π (s,a) = E π [G t |S t = s,A t = a] = E π





∞ X

k=0

γ k R t+k+1 S t = s,A t = a



In order to handle an RL challenge, generally speaking, we must identify

a policy that, over time, yields a large amount of reward The following is how

we could define an optimal policy for finite MDPs If the expected return using a

policy π is larger than or equal to the expected return applying a policy π′ for all

words, we can mathematically describe it as

π ≥ π′ iff v π (s) ≥ v π′(s) for all s ∈ S (2.7)

Trang 32

Optimal State-Value Function

An optimal state-value function is related to the optimal policy The

optimal state-value function is called as v∗, can be defined as follow

In other words, for each state s ∈ S, v∗ delivers the highest expected return possible under any policy π.

Optimal Action-Value Function

The optimal policy also has an optimal action-value function, or optimal

Q-function, which is denoted by q∗ and is defined

possible using policy π per state-action pair.

Q-learning is a model-free RL algorithm for determining the worth of anaction in a given state It has no demand for an environment model and canhandle problems with unpredictable transitions and rewards without the need foradaptations There are two policies in Q-learning [20]:

• Behavior policy: Use to generate experiences and interact with the ings

surround-• Target policy: Use to choose the best action after the learning process, andalso is the policy we are studying

Q-learning identifies the best policy for any FMDP in terms of optimizingthe predicted value of the total reward at each successive step, beginning withthe present state Given an unlimited exploration period of time and a partiallystochastic policy, Q-learning can determine an efficient action-value policy for anyparticular FMDP "Q" denotes the function by which the algorithm computes thequality of an action performed in a particular state

Trang 33

Values in Q-learning are kept in the Q-table and are referred to as Q-values.Each Q-value corresponds to a state-action pair and represents an actions madefrom that state.

Process of Q Learning:

1 Initialize Q-values: At this point, the Q value is set to an random fixed

value

2 Choose action a for state s(best Q-value): At initial, there is no best

Q-value Therefore, actions are chosen randomly so that the agent can explorethe environment The more training steps the agent gets, the more randomexploration is reduced and use exploitation instead

3 Perform action, move to next state.

In RL, Bellman Equation can be applied to update the Q value of Q Learning.Here is the formula to update the Q value:

Figure 2.7: Bellman Equation for Updating Q value

s t to s t+1 Q new (s t , a t ) is the new update q value maxQ(s t+1 , a) is the highest q

Trang 34

α : The learning rate (0 < α < 1) or step size influences how much new

information replaces old one A factor of 0 means that the agent learns nothing,whereas a factor of 1 means that the agent only considers the most current input

problem is unpredictable, the algorithm converges on the learning rate under certaintechnical circumstances that need it to decline to zero

0 causes the agent to focus only on immediate rewards, whereas a factor close to

1 causes it to seek a long-term high return If the discount factor is equal to orgreater than one, the action values may vary

During the training phase, the agent in RL determines which actions to

do To understand how action selection is carried out, we must first understandthe concepts of exploration and exploitation At first, the agent knows extremelylittle or nothing about its surroundings To have further experience about theenvironment, the agent can choose to discover by performing an action with anunknown outcome It may later determine to take an action based on its priorknowledge of the surroundings in order to reap excellent results In other words,

by exploring, the agent can improve its existing knowledge and receive higherlong-term benefits

The difficulty of balancing exploration and exploitation is unique to RL.Because exploration and exploitation cannot be carried out simultaneously, wemust choose which one to utilize to carry out an operation at a certain time [17]

The ϵ-greedy approach is a straightforward way for controlling exploration and

exploitation [14] Instead of choosing blindly one of the learnt effective actions with

regard to the Q-function, with ϵ-greedy, the agent can perform a random action with a fixed probability at each state, 0 ≤ ϵ ≤ 1, [18]:

π (s) =







random action from A(s) if ξ < ϵ

argmax a∈A(s) Q (s,a) otherwise (2.10)

where 0 ≤ ξ ≤ 1 is a uniform random number reducing after each episode The implementation of exploration or exploitation with ϵ-greedy can be demonstrated

in Algorithm 1 As we have discussed above, random action is usually selected

Tiêu đề	Applying Reinforcement Learning for Autonomous Robot Navigation in Unknown Environments
Tác giả	Tran Van Hoai, Tran Thanh Binh, Bui Quang Duc, Le Nguyen Anh Tu
Người hướng dẫn	Pham Hoang Anh, Ph.D
Trường học	Vietnam National University Ho Chi Minh City Ho Chi Minh City University of Technology
Chuyên ngành	Computer Engineering
Thể loại	Graduation project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	68
Dung lượng	4,39 MB