Algorithmic Trading: Gametheoretic and Simulation Approach to Reinforcement Learning bot

Keywords Data mining, Game theory, policy making process, reinforcement learning Algorithmic Trading: Game-theoretic and Simulation Approach to Reinforcement Learning bot... Source: Eure

Trang 1

Keywords

Data mining, Game theory, policy making process, reinforcement learning

Algorithmic Trading: Game-theoretic and Simulation Approach to Reinforcement Learning bot

Trang 3

The introduction to algorithms in trading definitely changed the stock market Algorithms made it easy to react fast to certain events on the stock market Machine learning algorithms also enabled analysts to create models for predicting prices of stocks much easier Introduction of machine learning caused that new models can be developed based on the past data The proof is the AI fund have outperformed their peers while providing downside protection, according to Eurekahedge’s report.

Trang 4

The table above is comparing AI funds to the average hedge fund and systematicCTA/managed futures strategies, which can be considered the rough approximation for theaverage quant fund Source: Eurekahedge.

For the successful performance of AI fund, in this paper we will describe introduction to themethod for creating artificial agent trading on stock market using stock prices and throughseveral machine learning algorithms

1.2 Objective of research

The monetary motivation behind the predictive value of buying and selling stocks at profitablepositions is a key driver of this research Our main hypothesis was that by applying machinelearning and training it on the past data, it is possible to predict the movement of the stock pricethrough market’s patterns, then applying algorithms to create a profitable trading agent We useProfit and Loss (PnL) factor of agent through the test to justify the profitability of our agent Weshall conduct some simulations to examine whether the agent is profitable under different dataset (seen and unseen) then calculate the average PnL of the agent

1.3 Scope of the research

This thesis only provides elementary introduction approach to the algorithmic trading and gametheory approach as the frame work for market environment The game environment isuncomplicated when we assumed that others respond to our agent’s strategies indicate the stockprice movement Moreover, the algorithms used to create and train the agent exploits from themachine learning algorithms library called “Scikit-learn”, “Keras” Nevertheless, exploitedalgorithms and functions shall be explained in the Appendix of this thesis

Trang 5

1.4 Overview

The thesis is organized in the following manner:

Chapter 1 is stated the motivation for writing this thesis, the objectives and scope of theresearch

In Chapter 2, we provide the background of Efficient Market Hypothesis (EMH) andit’s contradicts, as well as relevant works for this topic

The game theoretical frame work background for describe the market, simulationapproach and algorithms are established in Chapter 3

Chapter 4 describes the methods of data collection as well as data processing,implementation and simulation on different variable of model

Section 5 is the last section, we will discuss the final results of our agent, explain thelimitations of our research and state future improvement

Trang 6

Chapter 2: literature review:

This section begins with a background to efficient markets and then gives a brief review ofprevious empirical studies that use machine learning algorithms to construct trading strategies.1.5 Efficient Markets

One of the strongest oppositions to the existence of profitable trading strategies is founded on theideas of Efficient Market Hypothesis (EMH) Since EMH implies that our search forcontinuously profitably trading strategies is futile, we first give an overview of EMH and thenshow the empirical results that contradict this theory

EMH states that the current market price reflects the assimilation of all the informationavailable [13] That is, its proponents argue that since the stocks always trade at their fair value

on stock exchanges, it is impossible to outperform the overall market through expert stockselection or market timing Any new information is quickly integrated into the market price.Fama formalized the concept of efficient markets in 1970 by expressing the non-predictability ofmarket prices:

E( p ̃ j ,t +1∨Φtt)=[1+E(r ̃ j ,t +1∨Φtt )] pj ,t

Where:

pj, t is the price of security j at time t;

rj , t+1 is the one-period percentage return; and

Φtt is the information reflected at time t

Trang 7

Based on this expectation expression, Fama argues that there is no possibility of finding

excess market returns via market timing based solely on information in Φtt, hence dispelling the

possibility of trading strategies based on technical indicators

On the other hand, despite the theoretically sound nature of EMH, research over the last 30years has shown that several assumptions made in EMH may be unrealistic First, a fundamentalassumption is that investors behave rationally, or that the deviations of the many irrationalinvestors cancel out However, some research has shown that investors are not strictly rational[41], or devoid of biases [20] Indeed, people with a conservatism bias tend to underweight newinformation Moreover, experiments have shown that these biases tend to be systematic and thatdeviations do not cancel each other out [21] This leads to over- and under-reaction to newsevents

From the 1990s, literature has seen the growing decline of the EMH and the emergence ofbehavioral finance Behavioral finance views the market as an aggregate of human actions filledwith imperfect and inefficient decisions Under this theory, the financial markets are a reflection

of human desires, goals, motivations, errors and overconfidence [40] An alternative to EMH thathas grown traction is the idea of the Adaptive Market Hypothesis, which posits that profitopportunities from inefficiencies exist in finance markets but are eroded away as the knowledge

of the efficiency spreads throughout the public and the public capitalizes on the opportunities Bythis view of financial markets, many have built evolutionary and/or non-linear models anddemonstrated that excess returns can be attained on out-of-sample data

Trang 8

1.6 Previous Research

Because of their ability to model nonlinear relationships without pre-specification during themodeling process, neural networks (NNs) have become a popular method in financial time-seriesforecasting NNs also offer huge flexibility in the type of architecture of the model, in terms ofnumber of hidden nodes and layers Indeed, Pekkaya and Hamzacebi compare the results fromusing a linear regression versus a NN model to forecast macro variables and show that the NNgives much better results [35]

Many studies have used NNs and shown promising results in the financial markets.Grudnitski and Osburn implemented NNs to forecast S&P500 and Gold futures price directionsand found they were able to correctly prediction the direction of monthly price changes 75% and61% respectively [15] Another study showed that a NN-based model leads to higher arbitrageprofits compared to cost of carry models Phua, Ming and Lin implement a NN usingSingapore’s stock market index and show a forecasting accuracy of 81% [36] Similarly, NNmodels applied to weekly forecasting of Germany’s FAZ index find favorable predictive resultscompared to conventional statistical approaches [14]

More recently, NNs have been augmented or adapted to improve performance on financialtime series forecasting Shaoo et al show that cascaded functional link artificial neural networks(CFLANN) perform the best in FX markets [39] Egrioglu et al introduce a new method based

on feed forward artificial neural networks to analyze multivariate high order fuzzy time seriesforecasting models [12] Liao and Wang used a stochastic time effective neural network model toshow predictive results on the global stock indices Bildirici and Ersin combined NNs withARCH/GARCH and other volatility-based models to produce a model that out performed ANNs

or GARCH based models alone Moreover, Yudong and Lenan used back-trial chemotaxis

Trang 9

optimization (BCO) and back-propagation NN on S&P500 index and conclude that their hybridmodel (IBCO-BP) offers less computational complexity, better prediction accuracy and lesstraining time.

Another popular machine learning classification technique that does not require any domainknowledge or parameter setting is the decision tree It also often offers a better visuallyinterpretable model compared to NN, as the nodes in the tree can be easily understood Thesimplest type of decision tree model is the classification and regression tree (CART) Sorensen et

al show that CART decision trees perform better than single-factor models-based on the samevariables in picking stock portfolios [42] Wang and Chan use a two-layer bias decision tree topredict the daily stock prices of Microsoft, Intel and IBM, finding excess returns compared to abuy and hold method [43] Another study found that a boosted alternating decision tree withexpert weighing generated abnormal returns for the S&P500 index during the test period [11] Toimprove accuracy, some studies used the random forest algorithm for classification, which will

be further discussed in chapter 4 Namely, Booth et al show that a regency-weighted ensemble

of random forests produced superior results when analyzed on a large sample of stocks from theDAX in terms of both profitability and prediction accuracy compared with other ensembletechniques [7] Similarly, a gradient boosted random forest model applied to Singapore’s stockmarket was able to generate excess returns compared with a buy-and-hold strategy [37] Somerecent research combines decision tree analysis with evolutionary algorithms to allow the model

to adapt to changing market conditions Hsu et al present constraint-based evolutionaryclassification trees (CECT) and show strong predictability of a company’s financial performance[16]

Trang 10

Support Vector Machines (SVM) are also often used in prediction market behaviors Huang

et al compare SVM with other classification methods (random Walk, linear discriminantanalysis, quadratic discriminant analysis and elman backpropagation neural networks) and findsthat SVM performs the best in forecasting weekly movements of the Nikkei 225 index [17].Similarly, Kim compares SVM with NN and case-based reasoning (CBR) and finds that SVMoutperforms both in forecasting the daily direction of change in the Korea composite stock priceindex (KOSPI) [23] Likewise, Yang et al use a margin-varying Support Vector Regressionmodel and show empirical results that have good predictive value for the Hang Seng Index [46].Nair et al propose a system that is a genetic algorithm optimized decision treesupport vectormachine hybrid and validate its performance on the BSE-Sensex and found that its predictiveaccuracy is better than that of both a NN and Naive bayes based model [31]

While some studies have tried to compare various machine learning algorithms against eachother, the results have been inconsistent Patel et al compares four prediction models, NN, SVM,random forest and naive-Bayes and find that over a ten years period of various indices, therandom forest model performed the best However, Ou and Wang examine the performance often machine learning classification techniques on the Hang Sen Index and found that the SVMoutperformed the other models [33] Kara et al compared the performance of NN versus SVM

on the daily Istanbul Stock Exchange National 100 Index and found that the averageperformance of the NN model (75.74%) was significantly better than that of the SVM model(71.52%) [22]

Machine learning researches are focus on predictive modeling However, aiming to create anagent in dynamic environment that is able to learn and improve his performance policy duringtraining requires another approach of machine learning that is reinforcement learning, when

Trang 11

agent is created to find the optimal policies and maximize its reward But that is kind of aisolated way to think about the trading environment; what if there is other agents in the worldand in fact evidence suggest that there are in fact others agents exist in the world with our agent.Thus, game theory - the mathematics of conflict between participants is the missing piece tocomplete the model of market Eric Engle et al [note] provided the theoretical ideas of combininggame theory and machine learning to agent-based approach in stocks, but lack of implementationresult

Chapter 3: Theoretical reviews

In the first part of this chapter, we laid out the foundations of game theory At the beginning itformalizes the basic deﬁnitions, which are necessary to be able to correctly speak about gamesand game-plays Consecutively it presents the standard representations of games Thebackground in game theory is essential for ﬁnding rational responses and also for generalreasoning about games A mathematical formalization of game theory in this chapter is inspired

by [16] In the later part of the chapter, we shall mention how the game theory is applied tocreate decision making agent in stock market environment along with the difficulties oftraditional game theory approach and the need for simulation approach and algorithms

Game theory frame work

Game theory is a part of applied mathematics that studies a strategic decision making It usesmathematical models to formulate interactions between intelligent rational decision-makers.These interactions are called games

Trang 12

Games are played within a game environment (foot note :” The di erence between games andﬀerence between games andgame environments is sometimes omitted Although, it is useful to distinguish them, especially in

called world) and are composed of system of rules, which deﬁnes the players, the actions andpostulates the dynamics of the game The game is called a puzzle, if there is no more than oneagent involved Otherwise it is a conﬂict [18]

Deﬁnition 2.1 Player

A player (or an agent) is an entity able to act His activities alter the world in which he exists.

The concept of game consists of active and passive elements Passive elements represent theinformation, i.e which actions are feasible for a particular agent in a given state, or how thegame will evolve under certain conditions and actions taken Active elements in the game formthe players Without the players, the game remains static Only their actions can manipulate thegame

Deﬁnition 2.2 Action

An action (or a move) is a change in the game caused by a player in a particular situation.

A valid game environment enables all agents to act and be immediately aware of their actions.Their activity can lead to changing current situation as a consequence of their decision making

Di erent situations which can occur before the game terminates are called states of the game.ﬀerence between games andGame is played within a game environment

Trang 13

Every game begins in a root state and then progresses according to the game dynamics, asparticipating agents make their decisions All rational players select their actions to achieve theirgoals Theory of utility was established to recognize the e ects of their behavior and evaluate theﬀerence between games andsituations in which the agents are located Utility is a value which measures the usefulness of thecurrent state of the game for each player.

Deﬁnition 2.3 Utility

Let S be a set with weak ordering preference relation ≤ Utility (or outcome) is a cardinal element e ∈ S, representing the motivation of players The function u is said to be utility function IFF ∀x, y ∈ S: u(x) ≤ u(y) ⇔ x ≤ y.

All together, a mathematical game is a structure, which conclusively deﬁnes the whole game andits development

Deﬁnition 2.4 Game

Game is a tuple G=(P , A ,u), where:

P={ p1, p 2 ,… pm}1¿ is a set of players;

A={ A 1, A 2 , … Am } is a set of sets of available actions for each player; and

u is a utility function u :(a 1∈ A 1 × a2 ∈ A 2 × × am∈ Am)→ IR.

This general definition of game expects all players to act simultaneously in just one roundand then it ends Nevertheless, the end of a game in finite time is guaranteed only in the so-calledfinite games It signifies that at some point they will terminate and the utilities are assigned Allfinite games have starting and terminal states In these games the number of players is finite, as

Trang 14

well as the number of permitted actions for each player An agent can face only finitely manysituations in finite game, and the game-play cannot go on indefinitely [19]

This approach is certainly rational enough in puzzles, where there is only one agent to set thecourse of the world In contrast, in the environments with greater number of other players it isprefer able to rather randomize over the set of pure strategies, following selected probabilitydistribution Sometimes rather than a strategy, randomizing the decisions can be seen as a belief

of an agent, that he can proﬁt from playing such action This kind of strategy is called mixed.Playing a mixed strategy ensures that every agent can only guess what will happen; andcompared to the pure strategies, the outcome is now less predictable

Optimal strategy

The whole game theory was originally established to solve a simple question What is an optimalreaction? How should an agent react to be the most likely to win the game? The answer is thatthe fundamental advantage for a player can be an information about the strategies of hisopponents In other words, once an agent is able to guess the next action of any other agent, hecan deliberately follow a strategy which maximizes his terminal utility In conclusion, the set of

Trang 15

all optimal strategies (meaning the strategies with the highest equal expected utility

¿Σ∗( pi)Σ( pi) of a rational well-informed agent pi is then absolutely decided by the strategies

σ (p−i)∈ Σ( p−i) of the others.

Deﬁnition Best response

Agent’s strategy σ∗(pi) in game G=(P , A ,u) is a best response to strategies:

σ (p−i)IFF ∀ σ (pi)∈ Σ( pi):ui(σ ( pi), σ ( p−i))≤ ui(σ∗( pi), σ ( p−i)).

Unfortunately, in most cases the information about the opponents’ strategies is out of reach orobtaining is impossible in sense of computational complexity Another possibility would be toestimate the strategies, e g from the previous actions of other players, and consecutively adjusthis own one

Deﬁnition 2.5 Nash equilibrium (NE)

Given a game G=(P , A ,u) and strategies σ ∈ Σ, players P are in Nash equilibrium IFF ∀ i∈¿

If the stage of the world allows no one to benefit from changing his strategy, the situationremains stable It has been proved, that in every game with finitely many players and with finiteset of pure strategies, there is at least one Nash equilibrium profile, although it might consist ofmixed strategies [22] ].( choox nay xem references roi sua laic chop hu hop)

Game representations

There is a number of various representations of games The most simple one was presented at thebeginning of this part Although the general deﬁnition is su cient enough for the mathematicalﬃcient enough for the mathematicalapparatus, for concrete game examples it is more convenient to establish standard forms and

Trang 16

structures for working with the game data Di erent representations extend the generalfference between games anddefinition, thus allowing various games to express their specific aspects in more suitable form.Algorithms for finding Nash equilibria can be adapted to a particular representation to reducecomputational complexity There exist several representations of games, taking into accountstochasticity, number of players and decision points, possibility of cooperation and otherimportant characteristics of the game.

Normal form

Normal (or strategic) form is a basic type of game representation, G=(P , A ,u) Each player

moves once and actions are chosen simultaneously This makes the model simpler than otherforms and easier to solve for Nash equilibrium, but lacks any temporal locality

The most famous representative game of normal form game is Prisoner’s Dilemma which isdescribes as follow:

Two members of a criminal gang are arrested and imprisoned Each prisoner is in solitary confinement with no means of communicating with the other The prosecutors lack sufficient evidence to convict the pair on the principal charge They hope to get both sentenced to a year in prison on a lesser charge Simultaneously, the prosecutors offer each prisoner a bargain Each prisoner is given the opportunity either to: betray the other by testifying that the other committed the crime, or to cooperate with the other by remaining silent The offer is:

If X and Y each betray the other, each of them serves 5 years in prison

If X betrays Y but Y remains silent, X will be set free and Y will serve 20 years in prison (and vice versa)

Trang 17

If X and Y both remain silent, both of them will only serve 1 year in prison (on the lesser charge)

An example of Prisoner’s Dilemma game

From that example we would observe that both confess is the Nash equilibrium of this gamebecause both player have no incentive to change their options

Extensive form

Extensive form models a multi-agent sequential decision making Convenient representation of

an extensive-form game is a game tree Such structure allows to express even complicatedbranching of the game, restricting actions in di erent game states to the feasible ones only.ﬀerence between games and

Definition 2.6 Game tree

Every game tree is a tuple T =(S , Z , A , e , f ,r ), where:

S is a set of game states;

Z is a subset of S of terminal states;

Trang 18

A is a set of game actions;

e is an expander function, e: s ∈ S → {a ∈ A | a is executable in s};

f is a successor function, f: (s ∈ S × a ∈ e(s)) → t ∈ S; and

r ∈ S is a root state.

Using the notion of a game tree, now it is possible to deﬁne an extensive-form game Thisrepresentation consists of a game tree with a set of players, who are assigned to the states of thetree; and a utility function, which determines the utility in every terminal state, i.e in every leaf

of the game tree

Definition 2.7 Extensive-form games

Game in extensive form is a tuple G=(P ,T , b ,u), where:

A=π r2 is a set of players;

In the example of matching pennies in extensive form, the second player can always make herchoice dependent on the first player’s choice; if the first player selects Head, she will select Tail,

Trang 19

and if the first player selects Tail, she will select Head If paired with any of the two purestrategies of the first player, we have a Nash equilibrium in pure strategies.

An example of extensive-form game – Matching pennies

Stochastic games (Markov Games)

Arguably, most—if not all—real-world systems are influenced by events of a probabilisticnature Shapley (1953) was the first to define a game model that in corporates probabilisticchoices

Definition 2.8 Stochastic games

According to Shapley, stochastic games is a tuple of G=(S , A ,T , R i , γ , N ), where:

S: is the states of the games;

Ai is the set of available action for player i, ( A i ϵ A) A is the set of available action for players;

Trang 20

T: is the transitions function T (S ,(a i , a j), S ’), it means that at state S if player I choose action ai and others choose action simultaneously then the probability of reaching some next states S’;

R: is the reward for the players for taking chosen actions R ϶ (R i(a i ,a j), R j(a i , a j))

γ: is the discount factor.

N: is the number of players

Shapley games are played by a finite number of players on a finite state space, and in each state,each player chooses one of finitely many actions resulting profile of actions determines a rewardfor each player and a probability distribution on successor states

In principle, a stochastic game proceeds ad infinitum The payoff that each player receives isgiven by a function of the infinite stream of rewards for this player: Shapley considered games

where payoffs are discounted sum of rewards; other popular payoff functions are the limit average of the rewards or the total sum of the reward that was mentioned by Filar & Vrieze in

1997

A pure strategy in a stochastic game assigns an action to each possible sequence of states visited

so far, where as a randomized strategy assigns a probability distribution on actions to each such

sequence Hence, every player has at his command, the Nash’s theorem of equilibrium is notapplicable Nevertheless, in the case of discounted payoffs, there always exists a Nash

equilibrium in randomized strategies There is even a Nash equilibrium where strategies only

depend on the current state and not on the full history of visited states; we call such strategies

stationary For the general sum game, the Nash equilibria do not exist.

Trang 21

Thus, how the stochastic game could be applied to our research in order to create an agent havingthe ability to make decision without the supervised of human In principle, the stock market as astochastic game between our agent and others self interested agent, they can cooperate orcompetitive with others in order to gain the optimize reward However, the practical problem isunable to know all the information about other agents’ decision and state Then, in the context ofthis thesis, we describe the stock market game as two player stochastic game, all the interaction

of other agents to our agent’s action shall be reflected through market movement (nature) As can

be seen, it could be ease to directly apply the stochastic game to stock market where our agentchooses an action based on the current state, estimate the next available states and rewards, thenchoose the best respond at the current state However, it is impossible to predetermine all stateand available next states along with the rewards from taking actions because of the complex thenature of the market Fortunately, other research field that holds the key factor to solve ourproblem simulation and computer science approach in the form of machine learning

Simulation

In the following parts, we shall mention some key concept of simulation and machine learning toprovide more insight on how they could be the solution for the problem of traditional stochasticgame

Simulation

Simulation methods are ways to imitate of the operation of real-world systems It first requiresthat a model be developed representing characteristics, behaviors and functions of the selectedsystem or process The model represents the system itself, whereas the simulation represents theoperation of the system over time

Trang 22

The methods are widely used is Economy, Biology, Engineering and almost all sciences It isusually done using computers making changes to variables and performing predictions about thebehavior of the system A good example of the usefulness of computer simulation can be found

in automobile traffic simulation, grocery stores check out lines, inventory management, stockprices predictions, environmental consequences of policies and so on

Key issues in simulation include acquisition of valid source information about the relevantselection of key characteristics and behaviors, the use of simplifying approximations andassumptions within the simulation, and fidelity and validity of the simulation outcomes.Procedures and protocols for model verification and validation are an ongoing field of academicstudy, refinement, research and development in simulations technology or practice, particularly

in the field of computer simulation

The simulation procedure

Trang 23

Machine learning

Machine learning: Machine learning is a field of computer science that often uses statistical

techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed [Samuel, Arthur (1959) "Some Studies in Machine Learning Using the Game of Checkers" IBM Journal of Research and Development.]

Analysts like to talk about their model that they build in term of the problem that they solve Amodel is the process of taking in observations then provide predictions There was a lot ofmodels that people have built base on the application of simulation model, for example thefamous Black-Scholes model that predicts options prices Those models are developed by usingmathematical formula based

However, to deal with the problem of building an agent that can learn and adapt to theenvironment, we need simulation approach under the form of machine learning With machinelearning, we do not use direct observations like modeling, we try to use data The machinelearning process is to take historical data, run it through a machine learning algorithm to generatethe model The model is not built by human but the machine it self Then when we need to usethe model, we just provide some input and the out put come out automatically

Application to stock data

The application of machine learning approach to stock data is quite straight-forward, thefollowing figure shall describe how it works with historical stock data The historical datarepresents the value of the features for a particular stock through time horizon, we represent

Trang 24

those features by stacking these one behind the other We use machine learning algorithms totrain our agent based on those features and historical price.

Historical data

Features (x)P/EBollinger bandMoving average

Trang 25

The trading agent might be conveniently modeled in the framework of reinforcement learning asmention above This framework adjusts the parameters of an agent to maximize the expectedpayoff or reward generated due to its actions Therefore, the agent learns a policy that tells himthe actions it must perform to achieve its best performance This optimal policy is exactly what

we hope to find when we are building an automated trading strategy

To solving Stochastic games of our agent, Markov decision processes (MDPs) are the mostcommon model when implementing reinforcement learning It can be considered as narrowdown model of Stochastic games The MDP model of the environment consists, among otherthings, of a discrete set of states S and a discrete set of actions taken from A In this project, weonly mention the action set of our agent because we assume that other agent action will bereflected as price movement of the stock; depending on the position of the learner (long or short),

at each time step t it will be allowed to choose an action at from different subsets from the actionspace A, that consists of three possible actions:

a t ∈¿

Where:

None indicates that the agent shouldn't have any order in the market

Long and Short means that the agent should execute a market order to buy or sell 100stocks (the size of an order will always be a hundred shares)

So, at each discrete time step t, the agent senses the current state s t and choose to take an action

at The environment responds by providing the agent a reward rt=r (s t , a t) and by producing

Trang 26

the succeeding state st+1=δ (s t , a t) The functions r and δ only depend on the current state and

action (it is memoryless), are part of the environment and are not necessarily known to the agent

The task of the agent is to learn a policy π that maps each state to an action(π :S → A), selecting its next action at based solely on the current observed state st, that is π (st)=at The optimal

policy, or control strategy, is the one that produces the greatest possible cumulative reward overtime So, stating that:

V π (st )=rt+ yr t+ 1+γ2r t +1+ =∑

i=0

∞

y i r t +i

Where V π(st ) is also called the discounted cumulative reward and it represents the cumulative

value achieved by following a policy π from an initial state st and γ ∈[0,1] is a constant that

determines the relative value of delayed versus immediate rewards

If we set γ=0, only immediate rewards are considered As γ →1, future rewards are given greater emphasis relative to immediate reward The optimal policy π¿ that will maximize V π

So, as we are trying to maximize the cumulative rewards V¿

(st ) for all states s, the agent should

Trang 27

states, and it isn't able to perfectly predict the immediate reward and immediate successor for

every possible state-action transition, we also must learn V¿ indirectly

To solve that, we define a function Q(s , a) such that its value is the maximum discounted

cumulative reward that can be achieved starting from state s and applying action a as the firstaction So, we can write:

It implies that the optimal policy can be obtained even if the agent just uses the current action a

and state s and chooses the action that maximizes Q(s , a) Also, it is important to notice that the

function above implies that the agent can select optimal actions even when it has no knowledge

of the functions r ∧δ.

Lastly, there are some conditions to ensure that the reinforcement learning converges toward anoptimal policy On a deterministic MDP, the agent must select actions in a way that it visitsevery possible state-action pair infinitely often This requirement can be a problem in theenvironment that the agent will operate

Định dạng
Số trang	54
Dung lượng	1,14 MB