Artificial Intelligence Research and Development Edited by Beatriz López Institute of Informatics and Applications, University of Girona, Spain Joaquim Meléndez Institute of Informatics
Trang 1TLFeBOOK
Trang 2ARTIFICIAL INTELLIGENCE RESEARCH AND
DEVELOPMENT
Trang 3Frontiers in Artificial Intelligence and
Applications
FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and
“Knowledge-Based Intelligent Engineering Systems” It also includes the biannual ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications An editorial panel of internationally well-known scholars is appointed to provide a high quality selection
Series Editors:
J Breuker, R Dieng, N Guarino, J.N Kok, J Liu, R López de Mántaras,
R Mizoguchi, M Musen and N Zhong
Volume 131 Recently published in this series Vol 130 K Zieliński and T Szmuc (Eds.), Software Engineering: Evolution and Emerging
Technologies
Vol 129 H Fujita and M Mejri (Eds.), New Trends in Software Methodologies, Tools and
Techniques
Vol 128 J Zhou et al (Eds.), Applied Public Key Infrastructure
Vol 127 P Ritrovato et al (Eds.), Towards the Learning Grid
Vol 126 J Cruz, Constraint Reasoning for Differential Models
Vol 125 C.-K Looi et al (Eds.), Artificial Intelligence in Education
Vol 124 T Washio et al (Eds.), Advances in Mining Graphs, Trees and Sequences
Vol 123 P Buitelaar et al (Eds.), Ontology Learning from Text: Methods, Evaluation and
Applications
Vol 122 C Mancini, Cinematic Hypertext –Investigating a New Paradigm
Vol 121 Y Kiyoki et al (Eds.), Information Modelling and Knowledge Bases XVI
Vol 120 T.F Gordon (Ed.), Legal Knowledge and Information Systems – JURIX 2004: The
Seventeenth Annual Conference
Vol 119 S Nascimento, Fuzzy Clustering via Proportional Membership Model
Vol 118 J Barzdins and A Caplinskas (Eds.), Databases and Information Systems – Selected
Papers from the Sixth International Baltic Conference DB&IS’2004
Vol 117 L Castillo et al (Eds.), Planning, Scheduling and Constraint Satisfaction: From
Theory to Practice
Vol 116 O Corcho, A Layered Declarative Approach to Ontology Translation with
Knowledge Preservation
ISSN 0922-6389
Trang 4Artificial Intelligence Research and
Development
Edited by Beatriz López Institute of Informatics and Applications, University of Girona, Spain Joaquim Meléndez Institute of Informatics and Applications, University of Girona, Spain Petia Radeva Computer Vision Center & Department of Computer Science, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain
and Jordi Vitrià Computer Vision Center & Department of Computer Science, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
Trang 5© 2005 The authors
All rights reserved No part of this book may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, without prior written permission from the publisher
Distributor in the UK and Ireland Distributor in the USA and Canada
Trang 6Artificial Intelligence Research and Development v
B López et al (Eds.)
‘society of knowledge’ has been imposed to draw society nearer to the future and a symbol
of breakthrough From this perspective, AI has reached its maturity and it has exploded into
an endless set of sub-areas, getting in touch with all other disciplines to assist situation sessment, analysis and interpretation of music, management of environmental and biologi-cal systems, planning trains, routing of communication networks, assisting medical diagno-sis or powering auctions
as-The wide variety of Artificial Intelligence application areas has meant that AI ers often become scattered in different micro specialized conferences There are few occa-sions where the AI research community joins together, while computer scientists and engi-neers can find a lot of interesting ideas from the cross fertilization of results coming from all of these application areas The Catalan Association for Artificial Intelligence (ACIA1) is aware of the benefit of this contact and with this aim it organizes an annual conference to promote synergies in the research community of its influence This book provides a repre-sentative selection of papers resulting from this activity The advances made by ACIA peo-ple and its influence area have been gathered in this single volume as an update of previous volumes published in 2003 and 2004 (corresponding to series numbers 100 and 113) The book is organized according to the different sessions in which the papers were pre-sented at the Eighth Catalan Conference on Artificial Intelligence, held in Alguer (Italy) on October 26–28th, 2005 Namely: Neural Networks, Computer Vision, Applications, Ma-chine Learning, Reasoning, Planning and Robotics, and Multi-Agent Systems Papers have been selected after a double blind review process in which distinguished AI researchers from all over Europe participated Among the 77 papers received, 26 were selected as oral presentations and 25 as posters The quality of the papers was high on average, and the se-lection between an oral or a poster session was based on the degree of discussion that a pa-per could generate more than on its quality All of the papers collected in this volume would be of interest to any computer scientist or engineer interested in AI
research-We would like to express our sincere gratitude to all the authors and members of the scientific and organizing committees that have made this conference a success Our special thanks also to the plenary speakers for their effort in preparing the lectures
Alghero, October 2005 Beatriz López (University of Girona) Joaquim Meléndez (University of Girona) Petia Radeva (Computer Vision Center, UAB) Jordi Vitrià (Computer Vision Center, UAB)
1 ACIA, the Catalan Association for Artificial Intelligence, is member of the European Coordinating Committee for Artificial Intelligence (ECCAI) http://www.acia.org
Trang 7This page intentionally left blank
Trang 8Conference Organization
CCIA 2005 was organized by the University of Girona, the Computer Vision Center, the Universitat Autònoma de Barcelona and the Associació Catalana d’Intelligència Artificial General Chairs
Beatriz López, University of Girona
Joaquim Meléndez, University of Girona
Petia Radeva, Computer Vision Center
Jordi Vitrià, Computer Vision Center
Scientific Committee
Isabel Aguiló, Universitat de les Illes Balears
Josep Aguilar, Centre National de la Recherche Scientifique
Cecilio Angulo, Technical University of Catalonia
Rene Bañares-Alcántara, University of Oxford
Ester Bernardó, Ramon Llull University
Vicent Botti, Technical University of Valencia
Jaume Casasnovas, Universitat de les Illes Balears
Jesus Cerquides, University of Barcelona
M Teresa Escrig, Universitat Jaume I
Francesc Ferri, University of Valencia
Rafael García, University of Girona
Josep M Garrell, Ramon Llull University
Héctor Geffner, Pompeu Fabra University
Elisabet Golobardes, Ramon Llull University
Antoni Grau, Technical University of Catalonia
M Angeles Lopez, Universitat Jaume I
Beatriz López, University of Girona
Maite López, University of Barcelona
Joan Martí, University of Girona
Enric Martí, Computer Vision Center
Joaquím Melendez University of Girona
José del R Millán, IDIAP Research Institute
Margaret Miró, Universitat de les Illes Balears
Antonio Moreno, Rovira i Virgili University
Eva Onaindia, Technical University of Valencia
Miquel Angel Piera, Universitat Atònoma de Barcelona
Filiberto Pla, Universitat Jaume I
Enric Plaza, Artificial Intelligence Research Institute
Monique Polit, University of Perpignan
Josep Puyol-Gruart, Artificial Intelligence Research Institute
Petia Radeva, Computer Vision Center
Ignasi R.Roda, University of Girona
Josep Lluís de la Rosa, University of Girona
Xari Rovira, Ramon Llull University
Trang 9Mĩnica Sànchez, Technical University of Catalonia
Miquel Sànchez-Marrè, Technical University of Catalonia
Mássimo Tistarelli, Università degli Studi di Sassari
Ricardo Toledo, Computer Vision Center
Miguel Toro, University of Sevilla
Vicenc Torra, Artificial Intelligence Research Institute
Enric Trillas, Technical University of Madrid
Magda Valls, University of Lleida
Llorenç Valverde, Universitat de les Illes Balears
Jordi Vitrià, Computer Vision Center
Additional referees
Arantza Aldea (Oxford Brooks University), Yolanda Bolea, (Technical University of Catalunya), Mercedes E Narciso Farias (Universitat Atịnoma de Barcelona), Lluis Godo (Artificial Intelligence Research Institute), Luis González Abril (University of Sevilla), Felip Mađà (University of Lleida), Robert Martí (University of Girona), Pablo Noriega (Artificial Intelligence Research Institute), Raquel Ros (Artificial Intelligence Research Institute), Francisco Ruiz Vegas (Technical University of Catalunya), Aïda Valls (Rovira i Virgili University), Pere Vila (University of Girona)
Organizing Committee
Esteve del Acebo, ARLab, University of Girona
Marc Carreras, VICOROB, University of Girona
Joan Colomer, eXiT, University of Girona
Xavier Cufí, VICOROB, University of Girona
Joan Martí, VICOROB, University of Girona
Josep Lluís Marzo, BCDS, University of Girona
Ignasi Rodríguez-Roda, LEQUIA, University of Girona
Josep Lluís de la Rosa, ARLab, University of Girona
Massimo Tistarelli, Università degli Studi di Sassari
Josep Vehí, MICE, University of Girona
Web manager and Secretariat
Xavier Ortega, Montse Vila, Maria Brugué
Sponsoring Institutions
Ciutat de l'Alguer, assessorat de cultura
Trang 10Contents
Beatriz López, Joaquim Meléndez, Petia Radeva and Jordi Vitrià
and Monique Polit
Learning Human-Level AI Abilities to Drive Racing Cars 33Francisco Gallego, Faraón Llorens, Mar Pujol and Ramón Rizo
Feature Selection and Outliers Detection with Genetic Algorithms and
Agusti Solanas, Enrique Romero, Sergio Gómez, Josep M Sopena,
Rene Alquézar and Josep Domingo-Ferrer
Trang 11x
Image Segmentation Based on Inter-Feature Distance Maps 75
Susana Álvarez, Xavier Otazu and Maria Vanrell
Staff and Graphical Primitive Segmentation in Old Handwritten Music Scores 83
Alicia Fornés, Josep Lladós and Gemma Sánchez
Real-Time Face Tracking for Context-Aware Computing 91
Bogdan Raducanu and Jordi Vitrià
Experimental Study of the Usefulness of External Face Features for Face
Classification 99 Àgata Lapedriza, David Masip and Jordi Vitrià
Angle Images Using Gabor Filters in Cardiac Tagged MRI 107
Joel Barajas, Jaume Garcia-Barnés, Francesc Carreras,
Sandra Pujadas and Petia Radeva
Anna Bosch, Xavier Muñoz, Joan Martí and Arnau Oliver
Mass Segmentation Using a Pattern Matching Approach with a Mutual
Arnau Oliver, Jordi Freixenet, Joan Martí and Marta Peracaula
Feature Selection with Non-Parametric Mutual Information for Adaboost
Learning 131 Xavier Baró and Jordi Vitrià
3 Applications
OntoMusic: From Scores to Expressive Music Performances 141
Pere Ferrera and Josep Puyol-Gruart
Knowledge Production and Integration for Diagnosis, Treatment and Prognosis
John A Bohada and David Riaño
Application of Clustering Techniques in a Network Security Testing System 157
Guiomar Corral, Elisabet Golobardes, Oriol Andreu, Isard Serra,
Elisabet Maluquer and Àngel Martínez
Rut Garí, Ricardo Galli, Llorenç Valverde and Juan Fornés
A Hybrid System Combining Self Organizing Maps with Case Based Reasoning
L.E Mujica, J Vehí and J Rodellar
Acquiring Unobtrusive Relevance Feedback Through Eye-Tracking in Ambient
Gustavo González, Beatriz López, Cecilio Angulo and Josep Lluís de la Rosa
Trang 12xiEvolution Strategies for DS-CDMA Pseudonoise Sequence Design 189
Rosa Maria Alsina Pagès, Ester Bernadó Mansilla
and Jose Antonio Morán Moreno
Evaluation of Knowledge Bases by Means of Multi-Dimensional OWA Operators 197
Isabel Aguiló, Javier Martín, Gaspar Mayor and Jaume Suñer
Automatic Discovery of Synonyms and Lexicalizations from the Web 205
David Sánchez and Antonio Moreno
4 Machine Learning
Imbalanced Training Set Reduction and Feature Selection Through Genetic
Optimization 215
R Barandela, J.K Hernández, J.S Sánchez and F.J Ferri
A Case-Based Methodology for Feature Weighting Algorithm Recommendation 223
Héctor Núñez and Miquel Sànchez-Marrè
Comparison of Strategies Based on Evolutionary Computation for the Design
A Fornells Herrera, J Camps Dausà, E Golobardes i Ribé
and J.M Garrell i Guiu
Using Symbolic Descriptions to Explain Similarity on CBR 239
Eva Armengol and Enric Plaza
Isabela Drummond and Sandra Sandri
Multilingual Question Classification Based on Surface Text Features 255
E Bisbal, D Tomás, L Moreno, J.L Vicedo and A Suárez
5 Reasoning
On Warranted Inference in Possibilistic Defeasible Logic Programming 265
Carlos Chesñevar, Guillermo Simari, Lluís Godo and Teresa Alsinet
A Discretization Process in Accordance with a Qualitative Ordered Output 273
Francisco J Ruiz, Cecilio Angulo, Núria Agell, Xari Rovira, Mónica Sánchez
and Francesc Prats
Supervised Fuzzy Control of Dissolved Oxygen in a SBR Pilot Plant 281
M.F Teran, J Colomer, J Meléndez and J Colprim
On the Consistency of a Fuzzy C-Means Algorithm for Multisets 289
Vicenç Torra and Sadaaki Miyamoto
Trang 13xii
6 Planning and Robotics
Raquel Ros, Ramon López de Màntaras, Carles Sierra and Josep Lluís Arcos
The Use of a Reasoning Process to Solve the Almost SLAM Challenge at
M Teresa Escrig Monferrer and Juan Carlos Peris Broch
Oscar Sapena and Eva Onaindía
Bug-Based T2: A New Globally Convergent Approach to Reactive Navigation 331
Javier Antich and Alberto Ortiz
A Heuristic Technique for the Capacity Assessment of Periodic Trains 339
M Abril, M.A Salido, F Barber, L Ingolotti, A Lova and P Tormos
Development of a Webots Simulator for the Lauron IV Robot 347
Julio Pacheco and Francesc Benito
A Preliminary Study on the Relaxation of Numeric Features in Planning 355
Antonio Garrido, Eva Onaindía and Donato Hernández
Multi-Objective Multicast Routing Based on Ant Colony Optimization 363
Diego Pinto, Benjamín Barán and Ramón Fabregat
F Solano, R Fabregat, B Barán, Y Donoso and J.L Marzo
7 Multiagent Systems
Andrea Giovannucci, Juan A Rodríguez-Aguilar and Jesús Cerquides
The Agent Reputation and Trust (ART) Testbed Architecture 389
Karen K Fullam, Tomas B Klos, Guillaume Muller, Jordi Sabater-Mir,
Zvi Topol, K Suzanne Barber, Jeffrey Rosenschein and Laurent Vercouter
Estefania Argente, Vicente Julian, Soledad Valero and Vicente Botti
Modelling the Human Values Scale in Recommender Systems: A First Approach 405
Javier Guzmán, Gustavo González, Josep L de la Rosa and José A Castán
Solving Ceramic Tile Factory Production Programming by MAS 413
E Argente, A Giret, S Valero, P Gómez and V Julian
Integrating Information Sources for Recommender Systems 421
Silvana Aciar, Josefina López Herrera and Josep Lluis de la Rosa
Trang 14xiiiOntoPathView: A Simple View Definition Language for the Collaborative
E Jimenez, R Berlanga, I Sanz, M.J Aramburu and R Danger
Trang 15This page intentionally left blank
Trang 17This page intentionally left blank
Trang 183DXO9DOFNHQDHUVUHFHLYHGWKHDSSOLHGPDWKHPDWLFVHQJLQHHULQJGHJUHHLQWKHFRPSXWHUVFLHQFHHQJLQHHULQJGHJUHHLQDQGWKHPHFKDQLFDOHQJLQHHULQJ3K'GHJUHHLQDOOIURPWKH.DWKROLHNH8QLYHUVLWHLW/HXYHQ%HOJLXP6LQFHKHLVZLWK WKH PHFKDQLFDO HQJLQHHULQJ GHSDUWPHQW GLYLVLRQ 30$ RI WKH DWKROLHNH8QLYHUVLWHLW /HXYHQ +LV PDLQ UHVHDUFK LQWHUHVWV DUH LQ GLVWULEXWHG LQWHOOLJHQWPDQXIDFWXULQJ FRQWURO PXOWLDJHQW FRRUGLQDWLRQ DQG FRQWURO DQG GHVLJQ WKHRU\ IRUHPHUJHQWV\VWHPV
3DXO9DOFNHQDHUVLVWKHYLFHFKDLURI,)$&7&RQPDQXIDFWXULQJSODQWFRQWURO +H KDV SXEOLVKHG PRUH WKDQ SXEOLFDWLRQV LQ WKH GRPDLQ +H LV D PHPEHU RI WKHVWHHULQJFRPPLWWHHRIWKH,061HWZRUNRI([FHOOHQFHLQZKLFKKHLVFKDLULQJWKH6,*RQEHQFKPDUNLQJ RI PDQXIDFWXULQJ FRQWURO V\VWHPV +H FXUUHQWO\ SDUWLFLSDWHV LQ WKH (8
*URZWKSURMHFW03$RQPRGXODUSODQWDUFKLWHFWXUH WKH(8*URZWKSURMHFW0$%(RQPXOWLDJHQW EXVLQHVV HQYLURQPHQWV DQG LV WKH GDLO\ FRRUGLQDWRU RI WKH FRQFHUWHGUHVHDUFKDFWLRQ$J&R±IXQGHGE\WKH.8/HXYHQUHVHDUFKFRXQFLO±RQDJHQWEDVHGFRRUGLQDWLRQDQGFRQWURO,QWKHUHFHQWSDVW3DXO9DOFNHQDHUVSDUWLFLSDWHGLQWKH,06SURMHFW RQ +RORQLF 0DQXIDFWXULQJ 6\VWHPV DV PHPEHU RIWKHWHFKQLFDOFRRUGLQDWLRQERDUG DQGZDVWKHFRRUGLQDWRURIWKH ,06:RUNLQJ*URXSDQGWKH(8(VSULW/75SURMHFW0$6&$'$RQPXOWLDJHQWPDQXIDFWXULQJFRQWURO
B López et al (Eds.)
Artificial Intelligence Research and Development
IOS Press, 2005
© 2005 The authors All rights reserved.
3
Trang 19This page intentionally left blank
Trang 20IURPWKH8QLYHUVLW\RI*HQRD6LQFHKHKDVEHHQLQYROYHGDVFRRUGLQDWRUSULQFLSDO LQYHVWLJDWRU DQG WDVN PDQDJHU LQ YDULRXV SURMHFWV IXQGHG E\ WKH (XURSHDQ
&RPPXQLW\ $PRQJ WKHP 3 ,08 3 92,/$ ),567 %5 6(&21'
%59$37,'(029$,'DQG/759,56%6
'XULQJ DQG KH KDV EHHQ YLVLWLQJ WKH 'HSDUWPHQW RI &RPSXWHU6FLHQFH7ULQLW\&ROOHJH'XEOLQ,UHODQG,QKHZDVDYLVLWLQJVFLHQWLVWDW7KLQNLQJ0DFKLQHVDQGWKH0,7&DPEULGJH0DVVDFKXVVHWWV+HLVFXUUHQWO\DVVRFLDWHSURIHVVRUDWWKH)DFXOW\RI$UFKLWHFWXUHRIWKH8QLYHUVLW\RI6DVVDUL
+LVPDLQUHVHDUFKLQWHUHVWVFRYHUELRORJLFDODQGDUWLILFLDOYLVLRQELRPHWULFVURERWLFQDYLJDWLRQDQGYLVXRPRWRUFRRUGLQDWLRQ+HLVDXWKRURIPRUHWKDQVFLHQWLILFSDSHUVLQ FRQIHUHQFHV DQG LQWHUQDWLRQDO MRXUQDOV ,Q KH ZDV WKH FKDLUPDQ IRU WKH ,QWOZRUNVKRSRQ$GYDQFHVLQ)DFLDO,PDJH$QDO\VLVDQG5HFRJQLWLRQ7HFKQRORJ\DQGLQ
IRUWKH,QWOZRUNVKRSRQ³%LRPHWULF$XWKHQWLFDWLRQ´+HZDVDVVRFLDWHHGLWRUIRUWKHMRXUQDO,PDJHDQG9LVLRQ&RPSXWLQJKHLVFRHGLWRUIRUWKHVSHFLDOLVVXHRI,(((7UDQVDFWLRQVRQ&LUFXLWVDQG6\VWHPVIRU9LGHR7HFKQRORJ\RQ,PDJHDQG9LGHR%DVHG
B López et al (Eds.)
Artificial Intelligence Research and Development
IOS Press, 2005
© 2005 The authors All rights reserved.
5
Trang 21This page intentionally left blank
Trang 221 Neural Networks
Trang 23This page intentionally left blank
Trang 24Direct Policy Search Reinforcement Learning for Robot Control
Andres El-Fakdi1, Marc Carreras and Narcís Palomeras
University of Girona, Spain
Abstract.
In this paper, we present Policy Methods as an alternative to Value Methods
to solve Reinforcement Learning problems The paper proposes a Direct Policy
Search algorithm that uses a Neural Network to represent the control policies
De-tails about the algorithm and the update rules are given The main application of
the proposed algorithm is to implement robot control systems, in which the
gener-alization problem usually arises In this paper, we point out the suitability of our
algorithm in a RL benchmark, that was specially designed to test the generalization
capability of RL algorithms Results check out that policy methods obtain better
results than value methods in these situations.
Keywords Reinforcement learning, Direct Policy Search and Robot Learning
1 Introduction
A commonly used methodology in robot learning is Reinforcement Learning (RL) [1]
In RL, an agent tries to maximize a scalar evaluation (reward or punishment) obtained
as a result of its interaction with the environment The goal of a RL system is to find
an optimal policy which maps the state of the environment to an action which in turnwill maximize the accumulated future rewards Most RL techniques are based on FiniteMarkov Decision Processes (FMDP) causing finite state and action spaces The mainadvantage of RL is that it does not use any knowledge database, so the learner is nottold what to do as occurs in most forms of machine learning, but instead must discoveractions yield the most reward by trying them Therefore, this class of learning is suitablefor online robot learning The main disadvantages are a long convergence time and thelack of generalization among continuous variables
The dominant approach for solving the RL problem has been the use of a function but, although it has demonstrated to work well in many applications, it has sev-eral limitations If the state-space is not completely observable (POMDP), small changes
value-in the estimated value of an action cause it to be, or not be, selected; and this will nate in convergence problems [2] Over the past few years, studies have shown that ap-proximating directly a policy can be easier than working with value functions, and betterresults can be obtained [3,4] Instead of approximating a value function, new methodolo-
deto-1 Correspondence to: Andres El-Fakdi, Edifici PIV, Campus Montilivi, Universitat de Girona, 17071 Girona, Spain Tel.: +34 972 419 871; Fax: +34 972 418 259; E-mail: aelfakdi@eia.udg.es.
B López et al (Eds.)
Artificial Intelligence Research and Development
IOS Press, 2005
© 2005 The authors All rights reserved.
9
Trang 25gies approximate a policy using an independent continuous function approximator withits own parameters, trying to maximize the expected reward Examples of direct policymethods are the REINFORCE algorithm [5], the direct-gradient algorithm [6] and cer-tain variants of the actor-critic framework [7] The advantages of policy methods againstvalue-function based methods are various A problem for which the policy is easier torepresent should be solved using policy algorithms [4] Working this way should repre-sent a decrease in the computational complexity and, for learning control systems whichoperate in the physical world, the reduction in time-consuming would be notorious Fur-thermore, learning systems should be designed to explicitly account for the resulting vi-olations of the Markov property Studies have shown that stochastic policy-only methodscan obtain better results when working in POMDP than those ones obtained with deter-ministic value-function methods [8] On the other side, policy methods learn much moreslowly than RL algorithms using value function [3] and they typically find only localoptima of the expected reward [9].
We propose the use of an online Direct Policy Search (DPS) algorithm, based onBaxter and Bartlett’s direct-gradient algorithm OLPOMDP [10], for its application in thecontrol system of a real system, such as a robot This algorithm has the goal of learning astate/action mapping that will be applied in the control system The policy is represented
by a neural network whose input is a representation of the state, whose output is tion selection probabilities, and whose weights are the policy parameters The proposedmethod is based on a stochastic gradient descent with respect to the policy parameterspace, it does not need a model of the environment to be given and it is incremental,requiring only a constant amount of computation step The objective of the agent is tocompute a stochastic policy [8], which assigns a probability over each action
ac-The work presented in this paper is the continuation of a research line about robotlearning using RL, in which a more conventional value-function algorithm was first in-vestigated [11,12] The robot task used to test the algorithm was the learning of a targetfollowing behavior with an underwater robot This robot task has already been tested
in a simulation environment, obtaining very satisfactory results [13] In this paper, wedescribe in detail our DPS algorithm and show its efficiency in a RL benchmark, the
"mountain-car" task, to show the high generalization capability of policy methods
2 The DPS algorithm
A partially observable Markov decision process (POMDP) consists of a state space S, an observation space Y and a control space U For each state i ∈ S there is a deterministic
reward r(i) As mentioned before, the algorithm is designed to work on-line, at every time
step the learner (our robot) will be given an observation of the state and, according to thepolicy followed at that moment, it will generate a control action As a result, the learnerwill be driven to another state and will receive a reward associated to this new state Thisreward will allow us to update the controller’s parameters that define the policy followed
at every iteration, resulting in a final policy considered to be optimal or closer to optimal.The algorithm procedure is summarized in Table 1 The schema of the ANN, used toimplement the control policy, can be seen in Figure 1
The algorithm works as follows: having initialized the parameters vector θ0, theinitial statei0and the gradientz0= 0, the learning procedure will be iterated T times At
A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control
10
Trang 26Table 1 Algorithm: Baxter & Bartlett’s OLPOMDP
1
( , ) ( , )
of the problem
In order to clarify the steps taken, the next lines will relate the update parameterprocedure of the algorithm closely The controller uses a neural network as a functionapproximator that generates a stochastic policy Its weights are the policy parameters thatare updated on-line every time step The accuracy of the approximation is controlled bythe parameterβ ∈ [0, 1).
The first step in the weight update procedure is to compute the ratio:
n
ξ
Figure 1 Schema of the ANN architecture used.
Trang 27ξ
Action Selected!
y t In order to compute these gradients, we evaluate the soft-max distribution for eachpossible future state exponentiating the real-valued ANN outputs{o1, , o n }, being n
the number of neurons of the output layer [14]
After applying the soft-max function, the outputs of the neural network give aweighting,ξ j ∈ (0, 1), to each of the vehicle’s thrust combinations Finally, the proba-
bility of thei ththrust combination is then given by:
cho-to error back propagation [15] Before computing the gradient, the error on the neurons
of the output layer must be calculated This error is given by next expression:
e j = d j − Pr
The desired outputd jwill be equal to 1 if the action selected waso jand 0 otherwise(see Figure 2) With the soft-max output error calculation completed, next phase consists
in computing the gradient at the output of the ANN and back propagate it to the rest of
the neurons of the hidden layers For a local neuron j located in the output layer we may
express the local gradient for neuron j as:
A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control
12
Trang 28Wheree j is the soft-max error at the output of neuron j,ϕ j (o j) corresponds to thederivative of the activation function associated with that neuron ando j is the functionsignal at the output for that neuron So we do not back propagate the gradient of an errormeasure, but instead we back propagate the soft-max gradient of this error Therefore,for a neuron j located in a hidden layer the local gradient is defined as follows:
gra-Having all local gradients of all neurons calculated, the expression in Equation 2 can
be obtained and finally, the old parameters are updated following the expression:
The vector of parametersθ trepresents the network weights to be updated,r(i t+1) isthe reward given to the learner at every time step,z t+1describes the estimated gradientsmentioned before andγ is the learning rate of the DPS algorithm.
3 Experimental Results
3.1 The "mountain-car" task.
The "mountain-car" benchmark [16] was designed to evaluate the generalization bility of RL algorithms In this problem, a car has to reach the top of a hill, see Figure 3.However, the car is not powerful enough to drive straight to the goal Instead, it mustfirst reverse up the opposite slope in order to accelerate, acquiring enough momentum toreach the goal The states of the environment are two continuous variables, the position
capa-p and the velocity v of the car The action a is the force of the car, which can be capa-positive
and negative The reward is -1 everywhere except at the top of the hill, where it is 1.The dynamics of the system can be found in [16] The episodes in the mountain-car taskstart in a random position and velocity, and they run for a maximum of 200 iterations
or until the goal has been reached The optimal state/action mapping is not trivial sincedepending on the position and the velocity, the action has to be positive or negative
3.2 Results with a value-function algorithm
To provide a performance baseline, the classic Q_learning algorithm, which is based on
a value function, was applied The state space was finely discretized, with 180 states forthe position and 150 for the velocity The action space contained only three values, -1,
0 and 1 Therefore, the size of the Q table was 81000 cells The exploration strategywas an − greedy policy with set at 30% The discount factor was γ = 0.95 and
Trang 29-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 -1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
position (p)
a = sin(p)
Figure 3 The "mountain-car" task domain.
the learning rateα = 0.5, which were found experimentally The Q table was randomly
generated at the beginning of each experiment.In each experiment, a learning phase and
an evaluation phase were repeatedly executed In the learning phase, a certain number ofiterations were executed, starting new episodes when it was necessary In the evaluation
phase, 500 episodes were executed The effectiveness of learning was evaluated by
look-ing the averaged number of iterations needed to finish the episode After runnlook-ing 100experiments with discrete Q_learning, the average number of iterations when the optimalpolicy had been learnt was 50 with 1.3 standard deviation And the number of learningiterations to learn this optimal policy was1x107learning iterations Figure 4a shows theeffectiveness evolution of the Q_learning algorithm in front of the learning iterations It
is interesting to compare this mark with other state/action policies If a forward action(a = 1) is always applied, the average episode length is 86 If a random action is used,
the average is 110 These averages depend highly on the fact that the maximum number
of iterations in an episode is 200, since in a lot of episodes these policies do not fulfillthe goal
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10740
random action forward action
60 80 100 120 140 160
Figure 4 a) Effectiveness of the Q_learning algorithm with respect to the learning iterations After converging, the effectiveness was maximum, requiring only 50 iterations to accomplish the goal b) Effectiveness of the DPS algorithm with respect to the learning iterations The convergence time was much smaller, while a similar effectiveness (52 iterations) was achieved.
A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control
14
Trang 30-5 0 5 -1
-0.5 0 0.5 1
Figure 5 The hyperbolic tangent function.
3.3 Results with the DPS algorithm
A one-hidden-layer neural-network with 2 input nodes, 10 hidden nodes and 2 outputnodes has been used to generate a stochastic policy One of the inputs corresponds to thevehicle’s position, the other one represents the vehicleŠs velocity Each hidden and out-put layer has the usual additional bias term The activation function used for the neurons
of the hidden layer is the hyperbolic tangent type, see Equation 8 and Figure 5, while theoutput layer nodes are linear The two output neurons have been exponentiated and nor-malized as explained in section 2 to produce a probability distribution Control actionsare selected at random from this distribution
tanh(z) = sinh(z)
In each experiment, a learning phase and an evaluation phase were repeatedly ecuted In the learning phase, 500 number of iterations were executed, starting newepisodes when it was necessary In the evaluation phase, 200 episodes were executed
ex-The effectiveness of learning was evaluated by looking the averaged number of iterations
needed to finish the episode After running 100 experiments with the DPS algorithm, theaverage number of iterations when the optimal policy had been learnt was 52.5 And thenumber of learning iterations to learn this optimal policy was 40.000 learning iterations.Figure 4b shows the effectiveness evolution of the DPS algorithm in front of the learningiterations
81000 cells to obtain a similar policy
Effectiveness The minimum iterations to goal achieved by DPS (52.5) was practically
equal than the ones achieved by Q_learning (50)
Swiftness Although policy methods learn usually slower than value methods, in thiscase, the DPS algorithm was much faster than Q_learning (affected by the gener-alization problem)
Trang 314 Conclusions and Further Work
This paper has presented Policy Methods as an alternative to Value Methods to solveReinforcement Learning problems The paper has proposed a Direct Policy Search algo-rithm based on Baxter and Bartlett’s direct-gradient algorithm, with a Neural Network torepresent the policies Details about the algorithm with all the update rules were given.The main application of the proposed algorithm is to implement robot control systems,
in which the generalization problem usually arises In this paper, we have pointed out thesuitability of our algorithm in a RL benchmark, specially designed to test the generaliza-tion capability of RL algorithms Results have shown better results of policy methods inthese situations Future work will consist on testing the DPS algorithm with real robots
References
[1] R Sutton and A Barto Reinforcement Learning, an introduction MIT Press, 1998 [2] D.P Bertsekas and J N Tsitsiklis Neuro-Dynamic Programming Athena Scientific, Bel-
mont, MA, 1996
[3] R.S Sutton, D McAllester, S Singh, and Y Mansour Policy gradient methods for
rein-forcement learning with function approximation Advances in Neural Information Processing
Systems, 12:1057–1063, 2000.
[4] C Anderson Approximating a policy can be easier than approximating a value function.Technical Report Computer Science CS-00-101, Colorado State University, 2000
[5] R Williams Simple statistical gradient-following algorithms for connectionist reinforcement
learning Machine Learning, 8:229–256, 1992.
[6] J Baxter and P.L Barlett Reinforcement learning in POMDPs via direct gradient ascent In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
[7] V.R Konda and J.N Tsitsiklis On actor-critic algorithms SIAM Journal on Control and
Optimization, 42(4):1143–1166, 2003.
[8] S.P Singh, T Jaakkola, and M.I Jordan Learning without state-estimation in partially
ob-servable markovian decision processes In Proceedings of the Eleventh International
Confer-ence on Machine Learning, New Jersey, USA, 1994.
[9] N Meuleau, L Peshkin, and K Kim Exploration in gradient-based reinforcement learning.Technical Report AI Memo 2001-003, MIT, 2001
[10] J Baxter and P.L Bartlett Direct gradient-based reinforcement learning i: Gradient tion algorithms Technical report, Australian National University, 1999
estima-[11] M Carreras, P Ridao, and A El-Fakdi Semi-online neural-q-_learning for real-time
ro-bot learning In IEEE/RSJ International Conference on Intelligent Roro-bots and Systems, Las
Vegas, USA, 2003
[12] M Carreras and P Ridao Solving a RL generalization problem with the SONQL algorithm
In Seventh Catalan Conference on Artificial Intelligence, 2004.
[13] A El-Fakdi, M Carreras, N Palomeras, and P Ridao Autonomous underwater vehicle
con-trol using reinforcement learning policy search methods In IEEE Conference and Exhibition
Oceans’05 Europe, June 2005.
[14] Aberdeen D A Policy Gradient Algorithms for Partially Observable Markov Decision
Processes PhD thesis, Australian National University, 2003.
[15] S Haykin Neural Networks, a comprehensive foundation Prentice Hall, 2nd ed edition,
1999
[16] A.W Moore Variable resolution dynamic programming: Efficiently learning action maps on
multivariate real-value state-spaces In Proceedings of the Eighth International Conference
on Machine Learning, 1991.
A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control
16
Trang 326HOHFWLRQ3UREOHP
$OEHUWR)(51È1'(=DQG6HUJLR*Ï0(='HSDUWDPHQWG¶(QJLQ\HULD,QIRUPjWLFDL0DWHPjWLTXHV8QLYHUVLVWDW5RYLUDL9LUJLOL
7KLV SDSHU GHDOV ZLWK WKH SUREOHP RI WUDFLQJ RXW WKH HIILFLHQW IURQWLHU IRU WKHJHQHUDO PHDQYDULDQFH PRGHO ZKLFK LQFOXGHV D FDUGLQDOLW\ FRQVWUDLQW HQVXULQJ WKDWHDFK SRUWIROLR LQYHVWV H[DFWO\ LQ D JLYHQ QXPEHU RI GLIIHUHQW DVVHWV DQG ERXQGLQJFRQVWUDLQWV OLPLWLQJ WKH DPRXQW RI PRQH\ WR EH LQYHVWHG LQ HDFK DVVHW ,Q SUHYLRXVZRUNV VRPH KHXULVWLF PHWKRGV EDVHG RQ HYROXWLRQDU\ DOJRULWKPV >@ WDEX VHDUFK
>@DQGVLPXODWHGDQQHDOLQJ>@KDYHEHHQGHYHORSHG,QWKLVSDSHUZHSUHVHQWDGLIIHUHQWKHXULVWLFPHWKRGEDVHGRQ+RSILHOGQHWZRUNVDQGZHFRPSDUHWKHQHZUHVXOWVWR WKRVH REWDLQHG XVLQJ WKUHH UHSUHVHQWDWLYH PHWKRGV IURP >@ WKDW XVH JHQHWLFDOJRULWKPVWDEXVHDUFKDQGVLPXODWHGDQQHDOLQJ
5HJDUGLQJWKHGLVWULEXWLRQRIWKLVSDSHUWKHILUVWVHFWLRQSUHVHQWVWKHIRUPXODWLRQRI WKH SRUWIROLR VHOHFWLRQ SUREOHP WKH VHFRQG VHFWLRQ H[SODLQV WKH G\QDPLFV RI RXU+RSILHOGQHWZRUNLQWKHWKLUGVHFWLRQVRPHH[SHULPHQWDOUHVXOWVDUHSUHVHQWHGDQGWKHIRXUWKVHFWLRQILQLVKHVZLWKVRPHFRQFOXVLRQV
&RUUHVSRQGLQJ $XWKRU 6HUJLR *yPH] 'HSDUWDPHQW G¶(QJLQ\HULD ,QIRUPjWLFD L 0DWHPjWLTXHV 8QLYHUVLWDW 5RYLUD L 9LUJLOL &DPSXV 6HVFHODGHV $YLQJXGD GHOV 3DwVRV &DWDODQV ( 7DUUDJRQD 6SDLQ7HO)D[(PDLOVHUJLRJRPH]#XUYQHW
B López et al (Eds.)
Artificial Intelligence Research and Development
IOS Press, 2005
© 2005 The authors All rights reserved.
17
Trang 33:LWKWKHSXUSRVHRIJHQHUDOL]LQJWKHVWDQGDUG0DUNRZLW]PHDQYDULDQFHPRGHO>@WRLQFOXGHFDUGLQDOLW\DQGERXQGLQJFRQVWUDLQWVZHXVHDPRGHOIRUPXODWLRQWKDWFDQEHIRXQGDOVRLQ>@/HW1EHWKHQXPEHURIGLIIHUHQWDVVHWVμLEHWKHPHDQUHWXUQRIDVVHW L σLM EH WKH FRYDULDQFH EHWZHHQ UHWXUQV RI DVVHWV L DQG M λ∈>@ EH WKH ULVNDYHUVLRQSDUDPHWHU.EHWKHGHVLUHGQXPEHURIGLIIHUHQWDVVHWVLQWKHSRUWIROLRZLWKQRQXOOLQYHVWPHQWDQGOHWεLDQGδLEHUHVSHFWLYHO\WKHORZHUDQGXSSHUERXQGVIRUWKHSURSRUWLRQRIFDSLWDOWREHLQYHVWHGLQDVVHWLZLWK≤εL≤δL≤5HJDUGLQJWKHGHFLVLRQYDULDEOHV[LUHSUHVHQWVWKHSURSRUWLRQRIFDSLWDOWREHLQYHVWHGLQDVVHWLDQG]LLVLIDVVHWLLVLQFOXGHGLQWKHSRUWIROLRDQGRWKHUZLVH7KHJHQHUDOPHDQYDULDQFHPRGHOIRUWKHSRUWIROLRVHOHFWLRQSUREOHPLV
7KLV IRUPXODWLRQ RI WKH SRUWIROLR VHOHFWLRQ SUREOHP LV D PL[HG TXDGUDWLF DQGLQWHJHU SURJUDPPLQJ SUREOHP IRU ZKLFK HIILFLHQW DOJRULWKPV GR QRW H[LVW ,W LV DQLQVWDQFH RI WKH IDPLO\ RI PXOWLREMHFWLYH RSWLPL]DWLRQ SUREOHPV 7KHUHIRUH WKH ILUVWWKLQJWRGRLVWRDGRSWDGHILQLWLRQIRUWKHFRQFHSWRIRSWLPDOVROXWLRQ+HUHZHXVHWKH3DUHWRRSWLPDOLW\GHILQLWLRQWKDWLVDQ\IHDVLEOHVROXWLRQRIWKHSUREOHPLVDQRSWLPDOVROXWLRQ RU QRQGRPLQDWHG VROXWLRQ LI WKHUH LV QRW DQ\ RWKHU IHDVLEOH VROXWLRQLPSURYLQJRQHREMHFWLYHZLWKRXWPDNLQJZRUVHWKHRWKHU
:KHQ 1εL=0DQGδL=1WKHJHQHUDOPRGHOGHILQHGLQ(TV FRLQFLGHVZLWKWKH VWDQGDUG 0DUNRZLW] PHDQYDULDQFH PRGHO %RWK PRGHOV DV DQ\ RWKHUPXOWLREMHFWLYH RSWLPL]DWLRQ SUREOHP KDYH VHYHUDO GLIIHUHQW RSWLPDO VROXWLRQV 7KHREMHFWLYH IXQFWLRQ YDOXHV RI DOO QRQGRPLQDWHG VROXWLRQV IRUP ZKDW LW LV FDOOHG WKHHIILFLHQW IURQWLHU )RU WKH VWDQGDUG 0DUNRZLW] PRGHO WKH HIILFLHQW IURQWLHU LV DQLQFUHDVLQJFXUYHWKDWJLYHVWKHEHVWWUDGHRIIEHWZHHQPHDQUHWXUQDQGYDULDQFH
Trang 34ZKHUHHDFKRQHRIWKH1YDULDEOHV[LLQWKHSUREOHPLVUHSUHVHQWHGE\DQHXURQLQWKHQHWZRUNELLVWKHFRQVWDQWH[WHUQDOLQSXWELDV IRUQHXURQLDQGZLMLVWKHZHLJKWRIWKHV\QDSWLFFRQQHFWLRQIURPQHXURQLWRQHXURQM&RPSDULQJWKHHQHUJ\IXQFWLRQZLWKWKHREMHFWLYHIXQFWLRQLQ(T ... ,06:RUNLQJ*URXSDQGWKH(8(VSULW/75SURMHFW0$6&$''$RQPXOWLDJHQWPDQXIDFWXULQJFRQWURO
B López et al (Eds.)
Artificial Intelligence Research and Development< /small>
IOS Press, 2005
©...
IRUWKH,QWOZRUNVKRSRQ³%LRPHWULF$XWKHQWLFDWLRQ´+HZDVDVVRFLDWHHGLWRUIRUWKHMRXUQDO,PDJHDQG9LVLRQ&RPSXWLQJKHLVFRHGLWRUIRUWKHVSHFLDOLVVXHRI,(((7UDQVDFWLRQVRQ&LUFXLWVDQG6\VWHPVIRU9LGHR7HFKQRORJ\RQ,PDJHDQG9LGHR%DVHG
B López et al (Eds.)
Artificial Intelligence Research and Development< /small>
IOS Press, 2005
©... aelfakdi@eia.udg.es.
B López et al (Eds.)
Artificial Intelligence Research and Development< /small>
IOS Press, 2005
©