On the problem of synthesizing gestures for 3D virtual human

In addition, we also propose a new method for control choosing action based on supervised learning.. In a reinforcement learning process, at the beginning, the agent has no idea of the e

Trang 1

1

On the problem of synthesizing gestures

for 3D virtual human

Le Kim Thu

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Assoc Prof Bui The Duy

A thesis submitted in fulfillment of the requirements for the degree of

Master of Information Technology

June 2010

Trang 2

2

Originality Statement

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom i have worked at UET or elsewhere, is explicitly acknowledged

in the thesis I also declare that the intellectual content of this thesis is the product

of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged

Trang 3

All of my teachers in the Master course, whose lectures were supported me a solid foundation to work with my problem

My colleagues in the Faculty of applied Mathematics and Informatics, Hanoi university of Technology – who encourage me to join this course; who read my thesis and pointed out and helped me fixed my deficiencies

My friends, who commented on my writing styles and my project structures Last but not less, my family – especially my mother – for their encouragement, care and love They are people who always trust in me; stand by me in all the difficult situation This understanding is the priceless present I have ever had

Ha noi 06/2010

Le Kim Thu

Trang 4

4

ABSTRACT

Autonomous learning is one of the main reasons of the human development Many computer systems is also need this properties to improve its performance Nevertheless, it is very difficult to construct such system The reinforcement learning framework is the prominent solution for solving the problem In the thesis, we construct a self-learning system that uses a reinforcement learning algorithm - Q-learning In addition, we also propose a new method for control choosing action based on supervised learning Our experiments show that except some bad cases have happened, this strategy provided quite good result This is an encouragement for us continue to widen the problem to more complex problem

Trang 5

5

CONTENTS

CONTENTS 5

List of figures 6

INTRODUCTION 7

CHAPTER I: BACKGROUND 9

1.1 Gesture synthesis 9

1.2 Reinforcement learning 11

1.2.1 Model description 11

1.2.2 Solutions 13

1.3 Exploitation versus exploration 15

1.4 Exploration strategies 16

1.4.1 Predefined trajectory strategy 16

1.4.2 Next-Best-View 17

CHAPTER II: 20

A MODEL FOR LEARNING PICKING UP PROBLEM 20

2.1 Problem representation 20

2.2 Model identification 21

2.3 Propose exploration strategy 24

CHAPTER III 27

IMPLEMENTATION, EXPERIMENT AND EVALUATION 27

3.1 Overview 27

3.2 Experiment and result 30

CONCLUSION 33

References 34

Trang 6

6

List of figures

Figure 1.1 Data collection phase [41] 6

Figure 1.2: Data annotation 6

Figure 1.3: (a) Combine discrete segment and manual adjustment (b) Hole filling and detailed surface model 7

Figure 1.4: Reference skeleton joint transformation 7

Figure 1.5: An example of folding arm 8

Figure 1.6: Interaction in a reinforcement learning 9

Figure 1.7: Algorithm Q-learning for estimating Q(s, a) 11

Figure 1.8: Average performance of ε-greedy action-value methods [10] 15

Figure 2.1: Angles between an element and its parent 18

Figure 2.2: Shapes for fingers 19

Figure 3.1: An example of the learning system 25

Figure 3.2: Total time to build Q-table1 26

Figure 3.3: Average, min and max number of state in 25 run times 26

Figure 3.4: Number of steps before completes the task 27

Figure 3.5: Distances between the agent and the object in the worst case of new strategy 27

Figure 3.6: Distances between the agent and the object after each movement 28

Trang 7

7

INTRODUCTION

Simulation and modeling are catching a lot of concerns due to their wide range

of applications in the real life For instance, in biology, using those technologies,

we can simulate interactions of cells, ADN sequences or infection process of virus from infected cells to normal cells… [36] Many other simulations, such as test cases in constructing, designing… helped to save a lot of man power and money [45] Those technologies also enable us to practice many situations until we can handle the situation in the fact [46] Using this practice method help increase the driving skill without any damage

Among a great number of simulations, human simulating is one of the areas which have the biggest number of applications in real life Typical applications are entertainment, interaction system, education, e-commerce, heath, etc Modeling humans is a very difficult task, because of the complexity of human behaviors For example, when we look at children, we can see them trying to do one thing many times in order to do well Activities which are very simple to us such as sitting, standing and moving around are really difficult for them Therefore, they have to undergo many trials and error sequences before identifying what they should do Similarly, robots or embodied agents can also learn to plan their movement In this thesis, we focus on gesture synthesis with a machine learning approach Particularly, we use reinforcement learning for an embodied agent to learn to control its hand to raise an object

In a reinforcement learning process, at the beginning, the agent has no idea of the environment and what it should do to reach its goal The agent will perform some actions and observe responded information Based on that information, the agent learns the way to improve its performance The difficult problem is that the agent knows only parts of the environment, so it cannot guarantee if the information it received is good or not This issue leads to the problem of trading-off between exploration and exploitation To solve this problem, we also develop

an exploration algorithm based on supervised learning

The thesis is structured as follows Chapter 1 represents background of synthetic and reinforcement learning problem; it also presents in detail the

Trang 8

8

exploration and exploitation trade-off problem Our problem and a propose strategy to explore environment is described detail in Chapter 2 Chapter 3 displays result after we implement the problem, the comparison between two methods and the illustration for its

Trang 9

9

CHAPTER I: BACKGROUND

This chapter contains two parts, the first one presents some technique to

synthesis gesture of virtual human, and the second part presents the model of RL problem and illustrate the exploration and exploitation trade-off problem during learning process

1.1 Gesture synthesizing

Gesture synthesis is a wide range applicable problem A lot of algorithms propose to work with it [37, 38…] Some typical methods were proposed such as inverse kinetic methods [37, 44], using motion capture data [38, 41], considering a hierarchical frame [39, 42]

Motion capture data is an effective class of algorithms The typical order of this method as follow: first, data is collected; this data will be annotated manually to construct a database Then use some deformation technique for example: EFFD, NFFD [40], etc to generate the unit models Other techniques can be used to smooth the model Lastly, the gesture will be combined from these unit models Figure 1.1 presents an example of this approach

Figure 1.1 Data collection phase [41]

Figure 1.2: Data annotation

Trang 10

10

Figure 1.3: (a) Combine discrete segment and manual adjustment

(b) Hole filling and detailed surface model

The result of this approach is quite good; however, it is highly depended on the motion data that can be collected In addition, few systems could generate variant action styles

Another approach uses a frame for each parts of human to control the movement Some algorithms in this approach such as vertex-blending, bone-blending [39], volume-preserving deformation [43]… The main advantage of this approach is the ability to represent wide range of action in a system without too much data require

In our project, we based a hierarchical structure to control the model, so we will examine an algorithm to grasp the approach idea In bone-blending algorithms, authors build the skeleton frame of joints and bones and the skin is a set of triangles, each triangle is attach to bones or joints Changing skeleton made joints and bones move, these movement will

spread to the skin

Example 1:

The wrist of arm is considered a joint of

two bones (fig 1.5a)

The skin is determined by vertices on the

triangle mesh The different colors of vertices

refer to the different bones they depend on An

action will correspond to a shape of skeleton,

for example, if we bending the arm, the angle

between two bones change (fig 1.5b) This change also affects to the skin

Figure 1.4: Reference skeleton

joint transformation

Trang 11

11

Figure 1.5: An example of folding arm

Many other methods in this approach may add weight to each joint to manage the effect of bones over a vertex on the skin mesh or divide the skeleton into small parts; the changing will be control in these parts before combine with others In general, this approach is easier than the motion data approach to implement in a small scale This approach also can generate the action with very high quality, however, high quality require a huge computation that cannot implement on normal computer system [47] Because of the complicating of synthesizing gestures, we only used a simple method of the second approach; shape of frame is controlled by the reinforcement learning process

1.2 Reinforcement learning

Reinforcement learning (RL) is an interesting problem, which many scientists

in the field of artificial intelligence concern Though the idea of the problem appeared a very long time ago, about 1950s under the form of “optimal control” [23, 24], the problem does not attract lot of attention Most of the modern RL algorithms started to appear in the 1980s [17, 18] In this part, we consider problems connected with it

1.2.1 Model description

Reinforcement learning, supervised learning and unsupervised learning are three main problems in machine learning Different from other problems, where training data is nearly fixed, RL has a value called reward for estimating how well the action did From its own experience, the agent will adjust its behaviors to gain the highest reward (Fig 1.6)

Trang 12

12

Figure 1.6: Interaction in a reinforcement learning

Formally, the model consists of:

- : Set of environment states

- : Scalar reward value

- : Probability transition model (optional)

State of environments: in the problem, we considered only the discrete

environment It means that the environment can be divided into separate state however the algorithm can be extended to apply for continuous environment Among many types of environment, we only consider environments that have the Markov property It means that states of such environments represent all the relevant information from the past So, the probabilities of changing from one state

to another state only depend on the current state, not previous state This type of environment is called a Markov decision process (MDP)

Reward: the scalar value calculated by environment and sent back to the agent

It is a function of state and the action agent execute at that time

Notice that, the reward is computed by the environment And reward values only provide the result for the action in the state, not the way to find the solution (see the example below)

Trang 13

13

And agent’s job is finding policy , which defines agent behavior at a given

time, to maximize the total reward of agent

Example 2:

A very well known example is the game tic-tac-toe In this game, players use a 3x3 board; and symbols X and O to mark cells one by one If a player places his symbols together in a row, a column or a diagonal, he wins the game Then, elements of the problem are:

- State: in this situation the environment state is the value of the board A cell

- Action in each state: a player takes an action – placing his symbol on a

blank cell

- Reward: value 1 if the player wins, -1 if the player loses and 0 otherwise

Remember that, the action of placing the symbol that makes a couple in a line may lead to the win of game but still receive reward 0

- Return: As the goal focuses on the long-term reward, not immediate

reward, we are not only going to use the reward alone, but also to use the reward values afterward But the better action should happen early, so we multiple rewards with a discount factor to accumulate reward values and represent the important of good action should happen first This value is

usually called Return and is determined by the formula:

(1.1)

Goal of program is to find a policy π – strategies to choose actions in each state – that guarantee the agent will win the game

1.2.2 Solutions

Most of solutions for RL are based on some kind of value functions to decide which actions should do or which states agent should gain The most typical value functions are state value and state-action value – the first one evaluate how well if agent in a state while the second one estimate the utility if agent takes an action in

a state When agent is stand in a state of environment, according to the value function, it should take action that will change the state to a new state with

Trang 14

- Monte Carlo [1, 10…]: model-free This method use date as episodes with terminal state MC method also update value function similar to DP method but it store all sequences of transition to a terminal state, then value estimating of the state is clear and independent

- Temporal difference [6, 8, 26…]: model-free This method has the idea of two above methods, its means this method does not require a model and uses only a part of learned knowledge

One of the most well known algorithms is proposed by Watkins in [26] – learning algorithm The Q-learning algorithm use state-value pairs to help decision process At the beginning, each state-action pair is initialized with an arbitrarily value, then, during agent trying the return reward will be used to update these value The order of algorithm is presented in fig 1.7:

Q-Figure 1.7: Algorithm Q-learning for estimating Q(s, a)

The algorithms is quite easy to understand, but how we should choose the action a in the action set at each state The chosen state has to better than the others

Trang 15

15

in some ways How the agent uses its accumulated information to make such decision is a big issue that will be discussed in the next session

1.3 Exploitation versus exploration

We have known that agent has to try many time to observe information from environment When agent in a state, it is clearly that agent should take the action with maximum utility (e.g maximum reward) So it will use its belief to choose the best action to perform

Example 3.a:

Assume that the environment is in the state as in the table, and the

agent take symbol X The best action the agent should perform is

place its symbol on cell (3,3) – when agent completes this action,

it will receive reward 1 because it win the game

This is an example of exploitation process, that the agent

chooses actions based on its knowledge about environment If the agent has got enough information about the environment then agent may determine the action which will return a good result However, in many cases, e.g at the beginning of the learning process, the agent has almost no idea about environment or its belief is inadequate Therefore, the chosen action may not be the best action of the agent at that state (see e.g 3b)

Example 3.b:

We consider the same state of environment as in example 2.a But agent has tried to place symbol in only 3 cells: (2, 1), (3, 1) and (3, 3) Based on the utility of the three candidates, the agent will take its symbol in one of these three cells

In such situation, if the agent only uses available information, it would not have the optimal action So it should try another action in the action set to explore a new one – this is exploration process

Both processes are important, but contradict in execute Then we must build a strategy to balance them The strategy decides the important of each process in a state and how can take candidate action that may maximize agent long-term reward

X X

O

Trang 16

16

Many strategies were proposed to balance the two processes, for instance greedy strategy, using agent behavior… [25, 29 …] The next part represents some typical strategies to balance the two problems

1.4 Exploration strategies

As mentioned in the previous section, a strategy to explore the environment plays an important role in the learning process When an agent choose an action to perform, it often take the action which result better than that of other actions However, nothing guarantees that the agent was completely right, so sometimes it tries another action

There are a lot of exploration strategies, these strategies can be divided into two main categories The first one is predefined trajectory approach The agent collects and processes all information about the environment to build a model for the environment and to determine a trajectory The second trend is called next-best-view approach The agent accumulates environment information; choose the next best action in a state to execute We will study in details those strategies

1.4.1 Predefined trajectory strategy

This approach plays a very important role in robot control Many robots always interact in a small region and its environment does not change much over time

On the other hand, it’s not applied much in our reinforcement learning, hence,

we only take a glance at this approach In a large number of methods were proposed [29, 33…] the most popular idea is constructing a map to determine robot trajectory

One typical method is collecting environment images to synthesis agent belief For example, in [32] robot takes images of the surrounded environment in different angles, then analyze the density and orientation on 2D images to identify frontiers and key points and to build the complete model of environment Then the model is used to determine an optimal trajectory for robot acting

Although the method has gained very good results, it encounters an issue of region If the region in which agent interact is large, as in [33], it is very difficult to collect all the information In addition, the ability of the robot may be not affordable to process Consequently, it needs new method that can compute efficiently In [33], authors develop a new method, in which the environment is

Trang 17

17

divided into sub parts called local regions Then the robot constructs the map for each local region and finds the optimal trajectories within the region Robot will follow the trajectory was defined in the current region until it comes to another one

Experimental result over two methods had pointed out that: the second method has decreased a lot of requirement to implement the method While the result is very acceptable – it means that, the gaps between two methods is not large although the first one guarantees to minimize error on global environment while the other is not Many scientists, as a whole, have paid more and more concern in this direction

1.4.2 Next-Best-View

Strategies in this type of exploration use the information about candidate actions at a time to choose the action which is the best in some aspects In general, choosing action usually checks in a restricted space and base on utilities values that were accumulated by agent in learning process There are a lot of strategies of this type, for example ε-greedy [26], softmax [25], using GA [30], frontier-based [31], behavior-based [28, 29]… In this part, we focus on using greedy idea as an example for this trend of exploring the environment

As an agent chooses action with biggest utility, it is exploiting the environment based on its knowledge This method is called greedy method Since it should know more about environment, so the simplest idea we can think of is choosing an action randomly as in ε-greedy method:

At a state s of environment, agent will choose an action to execute as follow:

o With probability (1-ε): Choose the action has maximum utility

o With probability ε: Choose an action randomly among all candidates

By using this method, it is guaranteed that when the numbers of trials come to infinite, the number of choosing each action come to infinite too And the utility of each action will converge to its maximum value consequently Therefore, at the end of the learning process, the agent will be able to choose the best action

Trang 18

18

` In figure 1.3 [10], authors compared two methods We can see that, at the beginning, the bigger ε gave very good result compare to a small one, but the gap is reduced while the knowledge of environment increases In order to solve the issue, they may reduce ε values during learning process

Figure 1.8: Average performance of ε-greedy action-value methods [10]

The ε-greedy method is much better than the greedy method However, the exploring action is taken randomly It means very bad candidate action could be taken So [25] in proposed another method – softmax As such, the action will be ranked based on its estimated value and choosing phase will choose action correspond to the rank of an action For example, they use the following probability to choose an action:

(1.2)

There are also many other strategies to select the best action in the candidate set [30] uses greedy for known part of environment while use genetic algorithm to calculate and determine the best action in the unknown parts; [31, 34] chooses actions that lead agent to the frontier of a local, etc… They also execute two processes exploration and exploitation simultaneously and combine the effect of both processes to control action choosing like [35]

Định dạng
Số trang	37
Dung lượng	647,44 KB