1. Trang chủ
  2. » Giáo Dục - Đào Tạo

ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT docx

453 659 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Artificial Intelligence Research and Development
Trường học University of Girona
Chuyên ngành Artificial Intelligence
Thể loại Artificial Intelligence Research and Development
Năm xuất bản 2005
Thành phố Amsterdam
Định dạng
Số trang 453
Dung lượng 12,07 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Artificial Intelligence Research and Development Edited by Beatriz López Institute of Informatics and Applications, University of Girona, Spain Joaquim Meléndez Institute of Informatics

Trang 1

TLFeBOOK

Trang 2

ARTIFICIAL INTELLIGENCE RESEARCH AND

DEVELOPMENT

Trang 3

Frontiers in Artificial Intelligence and

Applications

FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and

“Knowledge-Based Intelligent Engineering Systems” It also includes the biannual ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications An editorial panel of internationally well-known scholars is appointed to provide a high quality selection

Series Editors:

J Breuker, R Dieng, N Guarino, J.N Kok, J Liu, R López de Mántaras,

R Mizoguchi, M Musen and N Zhong

Volume 131 Recently published in this series Vol 130 K Zieliński and T Szmuc (Eds.), Software Engineering: Evolution and Emerging

Technologies

Vol 129 H Fujita and M Mejri (Eds.), New Trends in Software Methodologies, Tools and

Techniques

Vol 128 J Zhou et al (Eds.), Applied Public Key Infrastructure

Vol 127 P Ritrovato et al (Eds.), Towards the Learning Grid

Vol 126 J Cruz, Constraint Reasoning for Differential Models

Vol 125 C.-K Looi et al (Eds.), Artificial Intelligence in Education

Vol 124 T Washio et al (Eds.), Advances in Mining Graphs, Trees and Sequences

Vol 123 P Buitelaar et al (Eds.), Ontology Learning from Text: Methods, Evaluation and

Applications

Vol 122 C Mancini, Cinematic Hypertext –Investigating a New Paradigm

Vol 121 Y Kiyoki et al (Eds.), Information Modelling and Knowledge Bases XVI

Vol 120 T.F Gordon (Ed.), Legal Knowledge and Information Systems – JURIX 2004: The

Seventeenth Annual Conference

Vol 119 S Nascimento, Fuzzy Clustering via Proportional Membership Model

Vol 118 J Barzdins and A Caplinskas (Eds.), Databases and Information Systems – Selected

Papers from the Sixth International Baltic Conference DB&IS’2004

Vol 117 L Castillo et al (Eds.), Planning, Scheduling and Constraint Satisfaction: From

Theory to Practice

Vol 116 O Corcho, A Layered Declarative Approach to Ontology Translation with

Knowledge Preservation

ISSN 0922-6389

Trang 4

Artificial Intelligence Research and

Development

Edited by Beatriz López Institute of Informatics and Applications, University of Girona, Spain Joaquim Meléndez Institute of Informatics and Applications, University of Girona, Spain Petia Radeva Computer Vision Center & Department of Computer Science, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain

and Jordi Vitrià Computer Vision Center & Department of Computer Science, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain

Amsterdam • Berlin • Oxford • Tokyo • Washington, DC

Trang 5

© 2005 The authors

All rights reserved No part of this book may be reproduced, stored in a retrieval system,

or transmitted, in any form or by any means, without prior written permission from the publisher

Distributor in the UK and Ireland Distributor in the USA and Canada

Trang 6

Artificial Intelligence Research and Development v

B López et al (Eds.)

‘society of knowledge’ has been imposed to draw society nearer to the future and a symbol

of breakthrough From this perspective, AI has reached its maturity and it has exploded into

an endless set of sub-areas, getting in touch with all other disciplines to assist situation sessment, analysis and interpretation of music, management of environmental and biologi-cal systems, planning trains, routing of communication networks, assisting medical diagno-sis or powering auctions

as-The wide variety of Artificial Intelligence application areas has meant that AI ers often become scattered in different micro specialized conferences There are few occa-sions where the AI research community joins together, while computer scientists and engi-neers can find a lot of interesting ideas from the cross fertilization of results coming from all of these application areas The Catalan Association for Artificial Intelligence (ACIA1) is aware of the benefit of this contact and with this aim it organizes an annual conference to promote synergies in the research community of its influence This book provides a repre-sentative selection of papers resulting from this activity The advances made by ACIA peo-ple and its influence area have been gathered in this single volume as an update of previous volumes published in 2003 and 2004 (corresponding to series numbers 100 and 113) The book is organized according to the different sessions in which the papers were pre-sented at the Eighth Catalan Conference on Artificial Intelligence, held in Alguer (Italy) on October 26–28th, 2005 Namely: Neural Networks, Computer Vision, Applications, Ma-chine Learning, Reasoning, Planning and Robotics, and Multi-Agent Systems Papers have been selected after a double blind review process in which distinguished AI researchers from all over Europe participated Among the 77 papers received, 26 were selected as oral presentations and 25 as posters The quality of the papers was high on average, and the se-lection between an oral or a poster session was based on the degree of discussion that a pa-per could generate more than on its quality All of the papers collected in this volume would be of interest to any computer scientist or engineer interested in AI

research-We would like to express our sincere gratitude to all the authors and members of the scientific and organizing committees that have made this conference a success Our special thanks also to the plenary speakers for their effort in preparing the lectures

Alghero, October 2005 Beatriz López (University of Girona) Joaquim Meléndez (University of Girona) Petia Radeva (Computer Vision Center, UAB) Jordi Vitrià (Computer Vision Center, UAB)

1 ACIA, the Catalan Association for Artificial Intelligence, is member of the European Coordinating Committee for Artificial Intelligence (ECCAI) http://www.acia.org

Trang 7

This page intentionally left blank

Trang 8

Conference Organization

CCIA 2005 was organized by the University of Girona, the Computer Vision Center, the Universitat Autònoma de Barcelona and the Associació Catalana d’Intelligència Artificial General Chairs

Beatriz López, University of Girona

Joaquim Meléndez, University of Girona

Petia Radeva, Computer Vision Center

Jordi Vitrià, Computer Vision Center

Scientific Committee

Isabel Aguiló, Universitat de les Illes Balears

Josep Aguilar, Centre National de la Recherche Scientifique

Cecilio Angulo, Technical University of Catalonia

Rene Bañares-Alcántara, University of Oxford

Ester Bernardó, Ramon Llull University

Vicent Botti, Technical University of Valencia

Jaume Casasnovas, Universitat de les Illes Balears

Jesus Cerquides, University of Barcelona

M Teresa Escrig, Universitat Jaume I

Francesc Ferri, University of Valencia

Rafael García, University of Girona

Josep M Garrell, Ramon Llull University

Héctor Geffner, Pompeu Fabra University

Elisabet Golobardes, Ramon Llull University

Antoni Grau, Technical University of Catalonia

M Angeles Lopez, Universitat Jaume I

Beatriz López, University of Girona

Maite López, University of Barcelona

Joan Martí, University of Girona

Enric Martí, Computer Vision Center

Joaquím Melendez University of Girona

José del R Millán, IDIAP Research Institute

Margaret Miró, Universitat de les Illes Balears

Antonio Moreno, Rovira i Virgili University

Eva Onaindia, Technical University of Valencia

Miquel Angel Piera, Universitat Atònoma de Barcelona

Filiberto Pla, Universitat Jaume I

Enric Plaza, Artificial Intelligence Research Institute

Monique Polit, University of Perpignan

Josep Puyol-Gruart, Artificial Intelligence Research Institute

Petia Radeva, Computer Vision Center

Ignasi R.Roda, University of Girona

Josep Lluís de la Rosa, University of Girona

Xari Rovira, Ramon Llull University

Trang 9

Mĩnica Sànchez, Technical University of Catalonia

Miquel Sànchez-Marrè, Technical University of Catalonia

Mássimo Tistarelli, Università degli Studi di Sassari

Ricardo Toledo, Computer Vision Center

Miguel Toro, University of Sevilla

Vicenc Torra, Artificial Intelligence Research Institute

Enric Trillas, Technical University of Madrid

Magda Valls, University of Lleida

Llorenç Valverde, Universitat de les Illes Balears

Jordi Vitrià, Computer Vision Center

Additional referees

Arantza Aldea (Oxford Brooks University), Yolanda Bolea, (Technical University of Catalunya), Mercedes E Narciso Farias (Universitat Atịnoma de Barcelona), Lluis Godo (Artificial Intelligence Research Institute), Luis González Abril (University of Sevilla), Felip Mađà (University of Lleida), Robert Martí (University of Girona), Pablo Noriega (Artificial Intelligence Research Institute), Raquel Ros (Artificial Intelligence Research Institute), Francisco Ruiz Vegas (Technical University of Catalunya), Aïda Valls (Rovira i Virgili University), Pere Vila (University of Girona)

Organizing Committee

Esteve del Acebo, ARLab, University of Girona

Marc Carreras, VICOROB, University of Girona

Joan Colomer, eXiT, University of Girona

Xavier Cufí, VICOROB, University of Girona

Joan Martí, VICOROB, University of Girona

Josep Lluís Marzo, BCDS, University of Girona

Ignasi Rodríguez-Roda, LEQUIA, University of Girona

Josep Lluís de la Rosa, ARLab, University of Girona

Massimo Tistarelli, Università degli Studi di Sassari

Josep Vehí, MICE, University of Girona

Web manager and Secretariat

Xavier Ortega, Montse Vila, Maria Brugué

Sponsoring Institutions

Ciutat de l'Alguer, assessorat de cultura

Trang 10

Contents

Beatriz López, Joaquim Meléndez, Petia Radeva and Jordi Vitrià

and Monique Polit

Learning Human-Level AI Abilities to Drive Racing Cars 33Francisco Gallego, Faraón Llorens, Mar Pujol and Ramón Rizo

Feature Selection and Outliers Detection with Genetic Algorithms and

Agusti Solanas, Enrique Romero, Sergio Gómez, Josep M Sopena,

Rene Alquézar and Josep Domingo-Ferrer

Trang 11

x

Image Segmentation Based on Inter-Feature Distance Maps 75

Susana Álvarez, Xavier Otazu and Maria Vanrell

Staff and Graphical Primitive Segmentation in Old Handwritten Music Scores 83

Alicia Fornés, Josep Lladós and Gemma Sánchez

Real-Time Face Tracking for Context-Aware Computing 91

Bogdan Raducanu and Jordi Vitrià

Experimental Study of the Usefulness of External Face Features for Face

Classification 99 Àgata Lapedriza, David Masip and Jordi Vitrià

Angle Images Using Gabor Filters in Cardiac Tagged MRI 107

Joel Barajas, Jaume Garcia-Barnés, Francesc Carreras,

Sandra Pujadas and Petia Radeva

Anna Bosch, Xavier Muñoz, Joan Martí and Arnau Oliver

Mass Segmentation Using a Pattern Matching Approach with a Mutual

Arnau Oliver, Jordi Freixenet, Joan Martí and Marta Peracaula

Feature Selection with Non-Parametric Mutual Information for Adaboost

Learning 131 Xavier Baró and Jordi Vitrià

3 Applications

OntoMusic: From Scores to Expressive Music Performances 141

Pere Ferrera and Josep Puyol-Gruart

Knowledge Production and Integration for Diagnosis, Treatment and Prognosis

John A Bohada and David Riaño

Application of Clustering Techniques in a Network Security Testing System 157

Guiomar Corral, Elisabet Golobardes, Oriol Andreu, Isard Serra,

Elisabet Maluquer and Àngel Martínez

Rut Garí, Ricardo Galli, Llorenç Valverde and Juan Fornés

A Hybrid System Combining Self Organizing Maps with Case Based Reasoning

L.E Mujica, J Vehí and J Rodellar

Acquiring Unobtrusive Relevance Feedback Through Eye-Tracking in Ambient

Gustavo González, Beatriz López, Cecilio Angulo and Josep Lluís de la Rosa

Trang 12

xiEvolution Strategies for DS-CDMA Pseudonoise Sequence Design 189

Rosa Maria Alsina Pagès, Ester Bernadó Mansilla

and Jose Antonio Morán Moreno

Evaluation of Knowledge Bases by Means of Multi-Dimensional OWA Operators 197

Isabel Aguiló, Javier Martín, Gaspar Mayor and Jaume Suñer

Automatic Discovery of Synonyms and Lexicalizations from the Web 205

David Sánchez and Antonio Moreno

4 Machine Learning

Imbalanced Training Set Reduction and Feature Selection Through Genetic

Optimization 215

R Barandela, J.K Hernández, J.S Sánchez and F.J Ferri

A Case-Based Methodology for Feature Weighting Algorithm Recommendation 223

Héctor Núñez and Miquel Sànchez-Marrè

Comparison of Strategies Based on Evolutionary Computation for the Design

A Fornells Herrera, J Camps Dausà, E Golobardes i Ribé

and J.M Garrell i Guiu

Using Symbolic Descriptions to Explain Similarity on CBR 239

Eva Armengol and Enric Plaza

Isabela Drummond and Sandra Sandri

Multilingual Question Classification Based on Surface Text Features 255

E Bisbal, D Tomás, L Moreno, J.L Vicedo and A Suárez

5 Reasoning

On Warranted Inference in Possibilistic Defeasible Logic Programming 265

Carlos Chesñevar, Guillermo Simari, Lluís Godo and Teresa Alsinet

A Discretization Process in Accordance with a Qualitative Ordered Output 273

Francisco J Ruiz, Cecilio Angulo, Núria Agell, Xari Rovira, Mónica Sánchez

and Francesc Prats

Supervised Fuzzy Control of Dissolved Oxygen in a SBR Pilot Plant 281

M.F Teran, J Colomer, J Meléndez and J Colprim

On the Consistency of a Fuzzy C-Means Algorithm for Multisets 289

Vicenç Torra and Sadaaki Miyamoto

Trang 13

xii

6 Planning and Robotics

Raquel Ros, Ramon López de Màntaras, Carles Sierra and Josep Lluís Arcos

The Use of a Reasoning Process to Solve the Almost SLAM Challenge at

M Teresa Escrig Monferrer and Juan Carlos Peris Broch

Oscar Sapena and Eva Onaindía

Bug-Based T2: A New Globally Convergent Approach to Reactive Navigation 331

Javier Antich and Alberto Ortiz

A Heuristic Technique for the Capacity Assessment of Periodic Trains 339

M Abril, M.A Salido, F Barber, L Ingolotti, A Lova and P Tormos

Development of a Webots Simulator for the Lauron IV Robot 347

Julio Pacheco and Francesc Benito

A Preliminary Study on the Relaxation of Numeric Features in Planning 355

Antonio Garrido, Eva Onaindía and Donato Hernández

Multi-Objective Multicast Routing Based on Ant Colony Optimization 363

Diego Pinto, Benjamín Barán and Ramón Fabregat

F Solano, R Fabregat, B Barán, Y Donoso and J.L Marzo

7 Multiagent Systems

Andrea Giovannucci, Juan A Rodríguez-Aguilar and Jesús Cerquides

The Agent Reputation and Trust (ART) Testbed Architecture 389

Karen K Fullam, Tomas B Klos, Guillaume Muller, Jordi Sabater-Mir,

Zvi Topol, K Suzanne Barber, Jeffrey Rosenschein and Laurent Vercouter

Estefania Argente, Vicente Julian, Soledad Valero and Vicente Botti

Modelling the Human Values Scale in Recommender Systems: A First Approach 405

Javier Guzmán, Gustavo González, Josep L de la Rosa and José A Castán

Solving Ceramic Tile Factory Production Programming by MAS 413

E Argente, A Giret, S Valero, P Gómez and V Julian

Integrating Information Sources for Recommender Systems 421

Silvana Aciar, Josefina López Herrera and Josep Lluis de la Rosa

Trang 14

xiiiOntoPathView: A Simple View Definition Language for the Collaborative

E Jimenez, R Berlanga, I Sanz, M.J Aramburu and R Danger

Trang 15

This page intentionally left blank

Trang 17

This page intentionally left blank

Trang 18

3DXO9DOFNHQDHUVUHFHLYHGWKHDSSOLHGPDWKHPDWLFVHQJLQHHULQJGHJUHHLQWKHFRPSXWHUVFLHQFHHQJLQHHULQJGHJUHHLQDQGWKHPHFKDQLFDOHQJLQHHULQJ3K'GHJUHHLQDOOIURPWKH.DWKROLHNH8QLYHUVLWHLW/HXYHQ%HOJLXP6LQFHKHLVZLWK WKH PHFKDQLFDO HQJLQHHULQJ GHSDUWPHQW GLYLVLRQ 30$ RI WKH DWKROLHNH8QLYHUVLWHLW /HXYHQ +LV PDLQ UHVHDUFK LQWHUHVWV DUH LQ GLVWULEXWHG LQWHOOLJHQWPDQXIDFWXULQJ FRQWURO PXOWLDJHQW FRRUGLQDWLRQ DQG FRQWURO DQG GHVLJQ WKHRU\ IRUHPHUJHQWV\VWHPV

3DXO9DOFNHQDHUVLVWKHYLFHFKDLURI,)$&7& RQPDQXIDFWXULQJSODQWFRQWURO +H KDV SXEOLVKHG PRUH WKDQ  SXEOLFDWLRQV LQ WKH GRPDLQ +H LV D PHPEHU RI WKHVWHHULQJFRPPLWWHHRIWKH,061HWZRUNRI([FHOOHQFHLQZKLFKKHLVFKDLULQJWKH6,*RQEHQFKPDUNLQJ RI PDQXIDFWXULQJ FRQWURO V\VWHPV +H FXUUHQWO\ SDUWLFLSDWHV LQ WKH (8

*URZWKSURMHFW03$ RQPRGXODUSODQWDUFKLWHFWXUH WKH(8*URZWKSURMHFW0$%( RQPXOWLDJHQW EXVLQHVV HQYLURQPHQWV  DQG LV WKH GDLO\ FRRUGLQDWRU RI WKH FRQFHUWHGUHVHDUFKDFWLRQ$J&R±IXQGHGE\WKH.8/HXYHQUHVHDUFKFRXQFLO±RQDJHQWEDVHGFRRUGLQDWLRQDQGFRQWURO,QWKHUHFHQWSDVW3DXO9DOFNHQDHUVSDUWLFLSDWHGLQWKH,06SURMHFW RQ +RORQLF 0DQXIDFWXULQJ 6\VWHPV DV PHPEHU RIWKHWHFKQLFDOFRRUGLQDWLRQERDUG DQGZDVWKHFRRUGLQDWRURIWKH ,06:RUNLQJ*URXSDQGWKH(8(VSULW/75SURMHFW0$6&$'$ RQPXOWLDJHQWPDQXIDFWXULQJFRQWURO 

B López et al (Eds.)

Artificial Intelligence Research and Development

IOS Press, 2005

© 2005 The authors All rights reserved.

3

Trang 19

This page intentionally left blank

Trang 20

IURPWKH8QLYHUVLW\RI*HQRD6LQFHKHKDVEHHQLQYROYHGDVFRRUGLQDWRUSULQFLSDO LQYHVWLJDWRU DQG WDVN PDQDJHU LQ YDULRXV SURMHFWV IXQGHG E\ WKH (XURSHDQ

&RPPXQLW\ $PRQJ WKHP 3 ,08 3 92,/$ ),567 %5 6(&21'

%59$37,'(029$,'DQG/759,56%6

'XULQJ   DQG  KH KDV EHHQ YLVLWLQJ WKH 'HSDUWPHQW RI &RPSXWHU6FLHQFH7ULQLW\&ROOHJH'XEOLQ,UHODQG,QKHZDVDYLVLWLQJVFLHQWLVWDW7KLQNLQJ0DFKLQHVDQGWKH0,7&DPEULGJH0DVVDFKXVVHWWV+HLVFXUUHQWO\DVVRFLDWHSURIHVVRUDWWKH)DFXOW\RI$UFKLWHFWXUHRIWKH8QLYHUVLW\RI6DVVDUL

+LVPDLQUHVHDUFKLQWHUHVWVFRYHUELRORJLFDODQGDUWLILFLDOYLVLRQELRPHWULFVURERWLFQDYLJDWLRQDQGYLVXRPRWRUFRRUGLQDWLRQ+HLVDXWKRURIPRUHWKDQVFLHQWLILFSDSHUVLQ FRQIHUHQFHV DQG LQWHUQDWLRQDO MRXUQDOV ,Q  KH ZDV WKH FKDLUPDQ IRU WKH ,QWOZRUNVKRSRQ$GYDQFHVLQ)DFLDO,PDJH$QDO\VLVDQG5HFRJQLWLRQ7HFKQRORJ\DQGLQ

IRUWKH,QWOZRUNVKRSRQ³%LRPHWULF$XWKHQWLFDWLRQ´+HZDVDVVRFLDWHHGLWRUIRUWKHMRXUQDO,PDJHDQG9LVLRQ&RPSXWLQJKHLVFRHGLWRUIRUWKHVSHFLDOLVVXHRI,(((7UDQVDFWLRQVRQ&LUFXLWVDQG6\VWHPVIRU9LGHR7HFKQRORJ\RQ,PDJHDQG9LGHR%DVHG

B López et al (Eds.)

Artificial Intelligence Research and Development

IOS Press, 2005

© 2005 The authors All rights reserved.

5

Trang 21

This page intentionally left blank

Trang 22

1 Neural Networks

Trang 23

This page intentionally left blank

Trang 24

Direct Policy Search Reinforcement Learning for Robot Control

Andres El-Fakdi1, Marc Carreras and Narcís Palomeras

University of Girona, Spain

Abstract.

In this paper, we present Policy Methods as an alternative to Value Methods

to solve Reinforcement Learning problems The paper proposes a Direct Policy

Search algorithm that uses a Neural Network to represent the control policies

De-tails about the algorithm and the update rules are given The main application of

the proposed algorithm is to implement robot control systems, in which the

gener-alization problem usually arises In this paper, we point out the suitability of our

algorithm in a RL benchmark, that was specially designed to test the generalization

capability of RL algorithms Results check out that policy methods obtain better

results than value methods in these situations.

Keywords Reinforcement learning, Direct Policy Search and Robot Learning

1 Introduction

A commonly used methodology in robot learning is Reinforcement Learning (RL) [1]

In RL, an agent tries to maximize a scalar evaluation (reward or punishment) obtained

as a result of its interaction with the environment The goal of a RL system is to find

an optimal policy which maps the state of the environment to an action which in turnwill maximize the accumulated future rewards Most RL techniques are based on FiniteMarkov Decision Processes (FMDP) causing finite state and action spaces The mainadvantage of RL is that it does not use any knowledge database, so the learner is nottold what to do as occurs in most forms of machine learning, but instead must discoveractions yield the most reward by trying them Therefore, this class of learning is suitablefor online robot learning The main disadvantages are a long convergence time and thelack of generalization among continuous variables

The dominant approach for solving the RL problem has been the use of a function but, although it has demonstrated to work well in many applications, it has sev-eral limitations If the state-space is not completely observable (POMDP), small changes

value-in the estimated value of an action cause it to be, or not be, selected; and this will nate in convergence problems [2] Over the past few years, studies have shown that ap-proximating directly a policy can be easier than working with value functions, and betterresults can be obtained [3,4] Instead of approximating a value function, new methodolo-

deto-1 Correspondence to: Andres El-Fakdi, Edifici PIV, Campus Montilivi, Universitat de Girona, 17071 Girona, Spain Tel.: +34 972 419 871; Fax: +34 972 418 259; E-mail: aelfakdi@eia.udg.es.

B López et al (Eds.)

Artificial Intelligence Research and Development

IOS Press, 2005

© 2005 The authors All rights reserved.

9

Trang 25

gies approximate a policy using an independent continuous function approximator withits own parameters, trying to maximize the expected reward Examples of direct policymethods are the REINFORCE algorithm [5], the direct-gradient algorithm [6] and cer-tain variants of the actor-critic framework [7] The advantages of policy methods againstvalue-function based methods are various A problem for which the policy is easier torepresent should be solved using policy algorithms [4] Working this way should repre-sent a decrease in the computational complexity and, for learning control systems whichoperate in the physical world, the reduction in time-consuming would be notorious Fur-thermore, learning systems should be designed to explicitly account for the resulting vi-olations of the Markov property Studies have shown that stochastic policy-only methodscan obtain better results when working in POMDP than those ones obtained with deter-ministic value-function methods [8] On the other side, policy methods learn much moreslowly than RL algorithms using value function [3] and they typically find only localoptima of the expected reward [9].

We propose the use of an online Direct Policy Search (DPS) algorithm, based onBaxter and Bartlett’s direct-gradient algorithm OLPOMDP [10], for its application in thecontrol system of a real system, such as a robot This algorithm has the goal of learning astate/action mapping that will be applied in the control system The policy is represented

by a neural network whose input is a representation of the state, whose output is tion selection probabilities, and whose weights are the policy parameters The proposedmethod is based on a stochastic gradient descent with respect to the policy parameterspace, it does not need a model of the environment to be given and it is incremental,requiring only a constant amount of computation step The objective of the agent is tocompute a stochastic policy [8], which assigns a probability over each action

ac-The work presented in this paper is the continuation of a research line about robotlearning using RL, in which a more conventional value-function algorithm was first in-vestigated [11,12] The robot task used to test the algorithm was the learning of a targetfollowing behavior with an underwater robot This robot task has already been tested

in a simulation environment, obtaining very satisfactory results [13] In this paper, wedescribe in detail our DPS algorithm and show its efficiency in a RL benchmark, the

"mountain-car" task, to show the high generalization capability of policy methods

2 The DPS algorithm

A partially observable Markov decision process (POMDP) consists of a state space S, an observation space Y and a control space U For each state i ∈ S there is a deterministic

reward r(i) As mentioned before, the algorithm is designed to work on-line, at every time

step the learner (our robot) will be given an observation of the state and, according to thepolicy followed at that moment, it will generate a control action As a result, the learnerwill be driven to another state and will receive a reward associated to this new state Thisreward will allow us to update the controller’s parameters that define the policy followed

at every iteration, resulting in a final policy considered to be optimal or closer to optimal.The algorithm procedure is summarized in Table 1 The schema of the ANN, used toimplement the control policy, can be seen in Figure 1

The algorithm works as follows: having initialized the parameters vector θ0, theinitial statei0and the gradientz0= 0, the learning procedure will be iterated T times At

A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control

10

Trang 26

Table 1 Algorithm: Baxter & Bartlett’s OLPOMDP

1

( , ) ( , )

of the problem

In order to clarify the steps taken, the next lines will relate the update parameterprocedure of the algorithm closely The controller uses a neural network as a functionapproximator that generates a stochastic policy Its weights are the policy parameters thatare updated on-line every time step The accuracy of the approximation is controlled bythe parameterβ ∈ [0, 1).

The first step in the weight update procedure is to compute the ratio:

n

ξ

Figure 1 Schema of the ANN architecture used.

Trang 27

ξ

Action Selected!

y t In order to compute these gradients, we evaluate the soft-max distribution for eachpossible future state exponentiating the real-valued ANN outputs{o1, , o n }, being n

the number of neurons of the output layer [14]

After applying the soft-max function, the outputs of the neural network give aweighting,ξ j ∈ (0, 1), to each of the vehicle’s thrust combinations Finally, the proba-

bility of thei ththrust combination is then given by:

cho-to error back propagation [15] Before computing the gradient, the error on the neurons

of the output layer must be calculated This error is given by next expression:

e j = d j − Pr

The desired outputd jwill be equal to 1 if the action selected waso jand 0 otherwise(see Figure 2) With the soft-max output error calculation completed, next phase consists

in computing the gradient at the output of the ANN and back propagate it to the rest of

the neurons of the hidden layers For a local neuron j located in the output layer we may

express the local gradient for neuron j as:

A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control

12

Trang 28

Wheree j is the soft-max error at the output of neuron j,ϕ  j (o j) corresponds to thederivative of the activation function associated with that neuron ando j is the functionsignal at the output for that neuron So we do not back propagate the gradient of an errormeasure, but instead we back propagate the soft-max gradient of this error Therefore,for a neuron j located in a hidden layer the local gradient is defined as follows:

gra-Having all local gradients of all neurons calculated, the expression in Equation 2 can

be obtained and finally, the old parameters are updated following the expression:

The vector of parametersθ trepresents the network weights to be updated,r(i t+1) isthe reward given to the learner at every time step,z t+1describes the estimated gradientsmentioned before andγ is the learning rate of the DPS algorithm.

3 Experimental Results

3.1 The "mountain-car" task.

The "mountain-car" benchmark [16] was designed to evaluate the generalization bility of RL algorithms In this problem, a car has to reach the top of a hill, see Figure 3.However, the car is not powerful enough to drive straight to the goal Instead, it mustfirst reverse up the opposite slope in order to accelerate, acquiring enough momentum toreach the goal The states of the environment are two continuous variables, the position

capa-p and the velocity v of the car The action a is the force of the car, which can be capa-positive

and negative The reward is -1 everywhere except at the top of the hill, where it is 1.The dynamics of the system can be found in [16] The episodes in the mountain-car taskstart in a random position and velocity, and they run for a maximum of 200 iterations

or until the goal has been reached The optimal state/action mapping is not trivial sincedepending on the position and the velocity, the action has to be positive or negative

3.2 Results with a value-function algorithm

To provide a performance baseline, the classic Q_learning algorithm, which is based on

a value function, was applied The state space was finely discretized, with 180 states forthe position and 150 for the velocity The action space contained only three values, -1,

0 and 1 Therefore, the size of the Q table was 81000 cells The exploration strategywas an − greedy policy with set at 30% The discount factor was γ = 0.95 and

Trang 29

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 -1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

position (p)

a = sin(p)

Figure 3 The "mountain-car" task domain.

the learning rateα = 0.5, which were found experimentally The Q table was randomly

generated at the beginning of each experiment.In each experiment, a learning phase and

an evaluation phase were repeatedly executed In the learning phase, a certain number ofiterations were executed, starting new episodes when it was necessary In the evaluation

phase, 500 episodes were executed The effectiveness of learning was evaluated by

look-ing the averaged number of iterations needed to finish the episode After runnlook-ing 100experiments with discrete Q_learning, the average number of iterations when the optimalpolicy had been learnt was 50 with 1.3 standard deviation And the number of learningiterations to learn this optimal policy was1x107learning iterations Figure 4a shows theeffectiveness evolution of the Q_learning algorithm in front of the learning iterations It

is interesting to compare this mark with other state/action policies If a forward action(a = 1) is always applied, the average episode length is 86 If a random action is used,

the average is 110 These averages depend highly on the fact that the maximum number

of iterations in an episode is 200, since in a lot of episodes these policies do not fulfillthe goal

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 10740

random action forward action

60 80 100 120 140 160

Figure 4 a) Effectiveness of the Q_learning algorithm with respect to the learning iterations After converging, the effectiveness was maximum, requiring only 50 iterations to accomplish the goal b) Effectiveness of the DPS algorithm with respect to the learning iterations The convergence time was much smaller, while a similar effectiveness (52 iterations) was achieved.

A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control

14

Trang 30

-5 0 5 -1

-0.5 0 0.5 1

Figure 5 The hyperbolic tangent function.

3.3 Results with the DPS algorithm

A one-hidden-layer neural-network with 2 input nodes, 10 hidden nodes and 2 outputnodes has been used to generate a stochastic policy One of the inputs corresponds to thevehicle’s position, the other one represents the vehicleŠs velocity Each hidden and out-put layer has the usual additional bias term The activation function used for the neurons

of the hidden layer is the hyperbolic tangent type, see Equation 8 and Figure 5, while theoutput layer nodes are linear The two output neurons have been exponentiated and nor-malized as explained in section 2 to produce a probability distribution Control actionsare selected at random from this distribution

tanh(z) = sinh(z)

In each experiment, a learning phase and an evaluation phase were repeatedly ecuted In the learning phase, 500 number of iterations were executed, starting newepisodes when it was necessary In the evaluation phase, 200 episodes were executed

ex-The effectiveness of learning was evaluated by looking the averaged number of iterations

needed to finish the episode After running 100 experiments with the DPS algorithm, theaverage number of iterations when the optimal policy had been learnt was 52.5 And thenumber of learning iterations to learn this optimal policy was 40.000 learning iterations.Figure 4b shows the effectiveness evolution of the DPS algorithm in front of the learningiterations

81000 cells to obtain a similar policy

Effectiveness The minimum iterations to goal achieved by DPS (52.5) was practically

equal than the ones achieved by Q_learning (50)

Swiftness Although policy methods learn usually slower than value methods, in thiscase, the DPS algorithm was much faster than Q_learning (affected by the gener-alization problem)

Trang 31

4 Conclusions and Further Work

This paper has presented Policy Methods as an alternative to Value Methods to solveReinforcement Learning problems The paper has proposed a Direct Policy Search algo-rithm based on Baxter and Bartlett’s direct-gradient algorithm, with a Neural Network torepresent the policies Details about the algorithm with all the update rules were given.The main application of the proposed algorithm is to implement robot control systems,

in which the generalization problem usually arises In this paper, we have pointed out thesuitability of our algorithm in a RL benchmark, specially designed to test the generaliza-tion capability of RL algorithms Results have shown better results of policy methods inthese situations Future work will consist on testing the DPS algorithm with real robots

References

[1] R Sutton and A Barto Reinforcement Learning, an introduction MIT Press, 1998 [2] D.P Bertsekas and J N Tsitsiklis Neuro-Dynamic Programming Athena Scientific, Bel-

mont, MA, 1996

[3] R.S Sutton, D McAllester, S Singh, and Y Mansour Policy gradient methods for

rein-forcement learning with function approximation Advances in Neural Information Processing

Systems, 12:1057–1063, 2000.

[4] C Anderson Approximating a policy can be easier than approximating a value function.Technical Report Computer Science CS-00-101, Colorado State University, 2000

[5] R Williams Simple statistical gradient-following algorithms for connectionist reinforcement

learning Machine Learning, 8:229–256, 1992.

[6] J Baxter and P.L Barlett Reinforcement learning in POMDPs via direct gradient ascent In

Proceedings of the Seventeenth International Conference on Machine Learning, 2000.

[7] V.R Konda and J.N Tsitsiklis On actor-critic algorithms SIAM Journal on Control and

Optimization, 42(4):1143–1166, 2003.

[8] S.P Singh, T Jaakkola, and M.I Jordan Learning without state-estimation in partially

ob-servable markovian decision processes In Proceedings of the Eleventh International

Confer-ence on Machine Learning, New Jersey, USA, 1994.

[9] N Meuleau, L Peshkin, and K Kim Exploration in gradient-based reinforcement learning.Technical Report AI Memo 2001-003, MIT, 2001

[10] J Baxter and P.L Bartlett Direct gradient-based reinforcement learning i: Gradient tion algorithms Technical report, Australian National University, 1999

estima-[11] M Carreras, P Ridao, and A El-Fakdi Semi-online neural-q-_learning for real-time

ro-bot learning In IEEE/RSJ International Conference on Intelligent Roro-bots and Systems, Las

Vegas, USA, 2003

[12] M Carreras and P Ridao Solving a RL generalization problem with the SONQL algorithm

In Seventh Catalan Conference on Artificial Intelligence, 2004.

[13] A El-Fakdi, M Carreras, N Palomeras, and P Ridao Autonomous underwater vehicle

con-trol using reinforcement learning policy search methods In IEEE Conference and Exhibition

Oceans’05 Europe, June 2005.

[14] Aberdeen D A Policy Gradient Algorithms for Partially Observable Markov Decision

Processes PhD thesis, Australian National University, 2003.

[15] S Haykin Neural Networks, a comprehensive foundation Prentice Hall, 2nd ed edition,

1999

[16] A.W Moore Variable resolution dynamic programming: Efficiently learning action maps on

multivariate real-value state-spaces In Proceedings of the Eighth International Conference

on Machine Learning, 1991.

A El-Fakdi et al / Direct Policy Search Reinforcement Learning for Robot Control

16

Trang 32

6HOHFWLRQ3UREOHP

$OEHUWR)(51È1'(=DQG6HUJLR*Ï0(='HSDUWDPHQWG¶(QJLQ\HULD,QIRUPjWLFDL0DWHPjWLTXHV8QLYHUVLVWDW5RYLUDL9LUJLOL

7KLV SDSHU GHDOV ZLWK WKH SUREOHP RI WUDFLQJ RXW WKH HIILFLHQW IURQWLHU IRU WKHJHQHUDO PHDQYDULDQFH PRGHO ZKLFK LQFOXGHV D FDUGLQDOLW\ FRQVWUDLQW HQVXULQJ WKDWHDFK SRUWIROLR LQYHVWV H[DFWO\ LQ D JLYHQ QXPEHU RI GLIIHUHQW DVVHWV DQG ERXQGLQJFRQVWUDLQWV OLPLWLQJ WKH DPRXQW RI PRQH\ WR EH LQYHVWHG LQ HDFK DVVHW ,Q SUHYLRXVZRUNV VRPH KHXULVWLF PHWKRGV EDVHG RQ HYROXWLRQDU\ DOJRULWKPV >@ WDEX VHDUFK

>@DQGVLPXODWHGDQQHDOLQJ>@KDYHEHHQGHYHORSHG,QWKLVSDSHUZHSUHVHQWDGLIIHUHQWKHXULVWLFPHWKRGEDVHGRQ+RSILHOGQHWZRUNVDQGZHFRPSDUHWKHQHZUHVXOWVWR WKRVH REWDLQHG XVLQJ WKUHH UHSUHVHQWDWLYH PHWKRGV IURP >@ WKDW XVH JHQHWLFDOJRULWKPVWDEXVHDUFKDQGVLPXODWHGDQQHDOLQJ

5HJDUGLQJWKHGLVWULEXWLRQRIWKLVSDSHUWKHILUVWVHFWLRQSUHVHQWVWKHIRUPXODWLRQRI WKH SRUWIROLR VHOHFWLRQ SUREOHP WKH VHFRQG VHFWLRQ H[SODLQV WKH G\QDPLFV RI RXU+RSILHOGQHWZRUNLQWKHWKLUGVHFWLRQVRPHH[SHULPHQWDOUHVXOWVDUHSUHVHQWHGDQGWKHIRXUWKVHFWLRQILQLVKHVZLWKVRPHFRQFOXVLRQV



  &RUUHVSRQGLQJ $XWKRU 6HUJLR *yPH] 'HSDUWDPHQW G¶(QJLQ\HULD ,QIRUPjWLFD L 0DWHPjWLTXHV 8QLYHUVLWDW 5RYLUD L 9LUJLOL &DPSXV 6HVFHODGHV $YLQJXGD GHOV 3DwVRV &DWDODQV  ( 7DUUDJRQD 6SDLQ7HO)D[(PDLOVHUJLRJRPH]#XUYQHW

B López et al (Eds.)

Artificial Intelligence Research and Development

IOS Press, 2005

© 2005 The authors All rights reserved.

17

Trang 33

:LWKWKHSXUSRVHRIJHQHUDOL]LQJWKHVWDQGDUG0DUNRZLW]PHDQYDULDQFHPRGHO>@WRLQFOXGHFDUGLQDOLW\DQGERXQGLQJFRQVWUDLQWVZHXVHDPRGHOIRUPXODWLRQWKDWFDQEHIRXQGDOVRLQ>@/HW1EHWKHQXPEHURIGLIIHUHQWDVVHWVμLEHWKHPHDQUHWXUQRIDVVHW L σLM EH WKH FRYDULDQFH EHWZHHQ UHWXUQV RI DVVHWV L DQG M λ∈>@ EH WKH ULVNDYHUVLRQSDUDPHWHU.EHWKHGHVLUHGQXPEHURIGLIIHUHQWDVVHWVLQWKHSRUWIROLRZLWKQRQXOOLQYHVWPHQWDQGOHWεLDQGδLEHUHVSHFWLYHO\WKHORZHUDQGXSSHUERXQGVIRUWKHSURSRUWLRQRIFDSLWDOWREHLQYHVWHGLQDVVHWLZLWK≤εL≤δL≤5HJDUGLQJWKHGHFLVLRQYDULDEOHV[LUHSUHVHQWVWKHSURSRUWLRQRIFDSLWDOWREHLQYHVWHGLQDVVHWLDQG]LLVLIDVVHWLLVLQFOXGHGLQWKHSRUWIROLRDQGRWKHUZLVH7KHJHQHUDOPHDQYDULDQFHPRGHOIRUWKHSRUWIROLRVHOHFWLRQSUREOHPLV

7KLV IRUPXODWLRQ RI WKH SRUWIROLR VHOHFWLRQ SUREOHP LV D PL[HG TXDGUDWLF DQGLQWHJHU SURJUDPPLQJ SUREOHP IRU ZKLFK HIILFLHQW DOJRULWKPV GR QRW H[LVW ,W LV DQLQVWDQFH RI WKH IDPLO\ RI PXOWLREMHFWLYH RSWLPL]DWLRQ SUREOHPV 7KHUHIRUH WKH ILUVWWKLQJWRGRLVWRDGRSWDGHILQLWLRQIRUWKHFRQFHSWRIRSWLPDOVROXWLRQ+HUHZHXVHWKH3DUHWRRSWLPDOLW\GHILQLWLRQWKDWLVDQ\IHDVLEOHVROXWLRQRIWKHSUREOHPLVDQRSWLPDOVROXWLRQ RU QRQGRPLQDWHG VROXWLRQ  LI WKHUH LV QRW DQ\ RWKHU IHDVLEOH VROXWLRQLPSURYLQJRQHREMHFWLYHZLWKRXWPDNLQJZRUVHWKHRWKHU

:KHQ 1εL=0DQGδL=1WKHJHQHUDOPRGHOGHILQHGLQ(TV    FRLQFLGHVZLWKWKH VWDQGDUG 0DUNRZLW] PHDQYDULDQFH PRGHO %RWK PRGHOV DV DQ\ RWKHUPXOWLREMHFWLYH RSWLPL]DWLRQ SUREOHP KDYH VHYHUDO GLIIHUHQW RSWLPDO VROXWLRQV 7KHREMHFWLYH IXQFWLRQ YDOXHV RI DOO QRQGRPLQDWHG VROXWLRQV IRUP ZKDW LW LV FDOOHG WKHHIILFLHQW IURQWLHU )RU WKH VWDQGDUG 0DUNRZLW] PRGHO WKH HIILFLHQW IURQWLHU LV DQLQFUHDVLQJFXUYHWKDWJLYHVWKHEHVWWUDGHRIIEHWZHHQPHDQUHWXUQDQGYDULDQFH

Trang 34

ZKHUHHDFKRQHRIWKH1YDULDEOHV[LLQWKHSUREOHPLVUHSUHVHQWHGE\DQHXURQLQWKHQHWZRUNELLVWKHFRQVWDQWH[WHUQDOLQSXW ELDV IRUQHXURQLDQGZLMLVWKHZHLJKWRIWKHV\QDSWLFFRQQHFWLRQIURPQHXURQLWRQHXURQM&RPSDULQJWKHHQHUJ\IXQFWLRQZLWKWKHREMHFWLYHIXQFWLRQLQ(T  ... ,06:RUNLQJ*URXSDQGWKH(8(VSULW/75SURMHFW0$6&$''$ RQPXOWLDJHQWPDQXIDFWXULQJFRQWURO 

B López et al (Eds.)

Artificial Intelligence Research and Development< /small>

IOS Press, 2005

©...

IRUWKH,QWOZRUNVKRSRQ³%LRPHWULF$XWKHQWLFDWLRQ´+HZDVDVVRFLDWHHGLWRUIRUWKHMRXUQDO,PDJHDQG9LVLRQ&RPSXWLQJKHLVFRHGLWRUIRUWKHVSHFLDOLVVXHRI,(((7UDQVDFWLRQVRQ&LUFXLWVDQG6\VWHPVIRU9LGHR7HFKQRORJ\RQ,PDJHDQG9LGHR%DVHG

B López et al (Eds.)

Artificial Intelligence Research and Development< /small>

IOS Press, 2005

©... aelfakdi@eia.udg.es.

B López et al (Eds.)

Artificial Intelligence Research and Development< /small>

IOS Press, 2005

©

Ngày đăng: 29/06/2014, 11:20

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] S. Edelkamp, J. Hoffmann, M. Littman, and H. Younes, editors. In Proc. IPC–2004, 2004 Sách, tạp chí
Tiêu đề: Proc. IPC–2004
[2] M. Fox and D. Long. PDDL2.1: an extension to PDDL for expressing temporal planning domains. JAIR, 20:61–124, 2003 Sách, tạp chí
Tiêu đề: JAIR
[3] A. Garrido and D. Long. Planning with numeric variables in multiobjective planning. In Proc. ECAI-2004, pages 662–666, 2004 Sách, tạp chí
Tiêu đề: Proc. ECAI-2004
[4] A. Gerevini, A. Saetti, and I. Serina. Planning through stochastic local search and temporal action graphs in LPG. JAIR, 20:239–290, 2003 Sách, tạp chí
Tiêu đề: JAIR
[5] J. Hoffmann. The Metric-FF planning system: Translating "ignoring delete lists" to numeric state variables. JAIR, 20:291–341, 2003 Sách, tạp chí
Tiêu đề: ignoring delete lists
[6] R. Sanchez Nigenda and S. Kambhampati. AltAlt p : Online parallelization of plans with heuristic state search. JAIR, 19:631–657, 2003 Sách, tạp chí
Tiêu đề: AltAlt"p": Online parallelization of plans withheuristic state search."JAIR
[7] B. Srivastava, S. Kambhampati, and M.B. Do. Planning the project management way: Effi- cient planning by effective integration of causal and resource reasoning in RealPlan. Artificial Intelligence, 131:73–134, 2001 Sách, tạp chí
Tiêu đề: ArtificialIntelligence
[8] T. Zimmerman and S. Kambhampati. Generating parallel plans satisfying multiple criteria in anytime fashion. In Proc. Workshop on Planning and Scheduling with Multiple Criteria (AIPS-2002), pages 56–66, 2002.A. Garrido et al. / A Preliminary Study on the Relaxation of Numeric Features in Planning 362 Sách, tạp chí
Tiêu đề: Proc. Workshop on Planning and Scheduling with Multiple Criteria(AIPS-2002)", pages 56–66, 2002."A. Garrido et al. / A Preliminary Study on the Relaxation of Numeric Features in Planning

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN