The core of the proposed solution is a reinforcement learning al-gorithm for selecting the actions that achieve the goal in the minimum number of steps.Position and force sensor values a
Trang 1Sensor-based learning for practical
planning of fine motions in robotics
Enric Cervera *, Angel P del Pobil
Department of Computer Science and Engineering, Jaume-I University, Castell o o, Spain Received 4 July2001; received in revised form 8 October 2001; accepted 28 November 2001
Abstract
This paper presents an implemented approach to part-mating of three-dimensionalnon-cylindrical parts with a 6 DOF manipulator, considering uncertainties in modeling,sensing and control The core of the proposed solution is a reinforcement learning al-gorithm for selecting the actions that achieve the goal in the minimum number of steps.Position and force sensor values are encoded in the state of the system by means of aneural network Experimental results are presented for the insertion of different parts –circular, quadrangular and triangular prisms – in three dimensions The system exhibitsgood generalization capabilities for different shapes and location of the assembledparts These results significantlyextend most of the previous achievements in finemotion tasks, which frequentlymodel the robot as a polygon translating in the plane in
a polygonal environment or do not present actual implemented prototypes
Ó 2002 Elsevier Science Inc All rights reserved
Keywords: Robotics; Neural nets; Reinforcement learning
1 Introduction
We present a practical framework for fine motion tasks, particularlytheinsertion of non-cylindrical parts with uncertainty in modeling, sensing andcontrol The approach is based on an algorithm which autonomouslylearns a
www.elsevier.com/locate/ins
* Corresponding author Present address: Department of Computer Science and Engineering, Jaume-I, Castell o, Spain.
E-mail addresses: ecervera@icc.uji.es (E Cervera), pobil@icc.uji.es (A.P del Pobil).
0020-0255/02/$ - see front matter Ó 2002 Elsevier Science Inc All rights reserved.
PII: S 0 0 2 0 - 0 2 5 5 ( 0 2 ) 0 0 2 2 8 - 1
Trang 2relationship between sensed states and actions This relationship allows therobot to select those actions which attain the goal in the minimum number ofsteps A feature extraction neural network complements the learning algo-rithm, forming a practical sensing-action architecture for manipulation tasks.
In the type of motion planning problems addressed in this work, interactionsbetween the robot and objects are allowed, or even mandatory, for operationssuch as compliant motions and parts mating We restrict ourselves to taskswhich do not require complex plans; however, theyare significantlydifficult toattain in practice due to uncertainties Among these tasks, the peg-in-hole in-sertion problem has been broadlystudied, but veryfew results can be found inthe literature for three-dimensional non-cylindrical parts in an actual imple-mentation
We believe that practicality, although an important issue, has been vastlyunderestimated in fine motion methods, since most of these approaches arebased on geometric models which become complex for non-trivial cases espe-ciallyin three dimensions [1]
The remainder of this paper is structured as follows Section 2 reviews somerelated work and states the keycontributions of our work In Section 3, wedescribe the components of the architecture Thorough experimental results arethen presented in Section 4 Finally, Section 5 discusses a number of issuesregarding the proposed approach, and draws some conclusions
2 Background and motivation
2.1 Related research
Though the peg-in-hole problem has been exhaustivelystudied for a longtime [2–4], most of the implementations have been limited to planar motions orcylindrical parts [5–7] Caine et al [8] pointed out the difficulties of insertingprismatic pegs To our knowledge, our results are the first for a system whichlearns to insert non-cylindrical pegs (see Fig 1) in a real-world task with un-certaintyin position and orientation
Parts mating in real-world industryis frequentlyperformed bypassivecompliance devices [4], which support parts and aid their assembly They arecapable of high-speed precision insertions, but theylack the flexibilityofsoftware methods
A difficult issue in parts mating is the need for nonlinear compliance forchamferless insertions, which was demonstrated byAsada [2], who proposed asupervised neural network for learning the nonlinear relationship betweensensing and motion in a two-dimensional frictionless peg-in-hole task The use
of a supervised network presents a great difficultyin real-world sional problems, since a proper training set has to be generated
Trang 3three-dimen-Lozano-Peerez [9] first proposed a formal approach to the synthesis ofcompliant-motion strategies from geometric descriptions of assemblyopera-tions and explicit estimates of errors in sensing and control In an extension tothis approach, Donald [10] presented a formal framework for computingmotion strategies which are guaranteed to succeed in the presence of threekinds of uncertainty(sensing, control and model) Experimental verification isdescribed in [11], but onlyfor planar tasks Following DonaldÕs work, Briggs[12] proposed an Oðn2logðnÞÞ algorithm, where n is the number of vertices inthe environment, for the basic problem of manipulating a point from a spec-ified start region to a specified goal region amidst planar polygonal obstacleswhere control is subject to uncertainty Latombe et al [13] describe twopractical methods for computing preimages for a robot having a two-dimen-sional Euclidean configuration space Though the general principles of theplanning methods immediatelyextend to higher dimensional spaces, the geo-metric algorithms do not, and onlysimulated examples of planar tasks areshown LaValle and Hutchinson [14] present another framework for manipu-lation planning under uncertainty, based on preimages, though they considersuch approach to be reasonable onlyfor a few dimensions Their computedexamples are restricted to planar polygonal models.
A different geometric approach is introduced byMcCarragher and Asada[15] who define a discrete event in assemblyas a change in contact state re-flecting a change in a geometric constraint The discrete event modeling isaccomplished using Petri nets Dynamic programming is used for task-levelplanning to determine the sequence of desired markings (contact state) fordiscrete event control that minimizes a path length and uncertaintyperfor-mance measure The method is applied to a dual peg-in-hole insertion task, butthe motion is kept planar
Learning methods provide a framework for autonomous adaptation andimprovement during task execution An approach to learning a reactive controlstrategyfor peg-in-hole insertion under uncertaintyand noise is presented in[16] This approach is based on active generation of compliant behavior using a
Fig 1 Diagram of the insertion task.
Trang 4nonlinear admittance mapping from sensed positions and forces to velocitycommands The controller learns the mapping through repeated attempts atpeg insertion A two-dimensional version of the peg-in-hole task is imple-mented on a real robot The controller consists of a supervised neural network,with stochastic units In [5] the architecture is applied to a real ball-balancingtask, and a three-dimensional cylindrical peg-in-hole task Kaiser and Dillman[17] propose a hierarchical approach to learning the efficient application ofrobot skills in order to solve complex tasks Since people can carryout ma-nipulation tasks with no apparent difficulty, they develop a method for theacquisition of sensor-based robot skills from human demonstration Twomanipulation skills are investigated: peg insertion and door opening Distante
et al [18] applyreinforcement learning techniques to the problem of targetreaching byusing visual information
en-If robots can be modeled as polygons moving amid polygonal obstacles in aplanar world, and a detailed model is available, a geometric framework is fine.However, since such conditions are rarelyfound in practice, we argue that arobust, adaptive, autonomous learning architecture for robot manipulationtasks – particularlypart mating – is a necessaryalternative in real-world en-vironments, where uncertainties in modeling, sensing and control are un-avoidable
3 A practical adaptive architecture
Fig 2 depicts the three components of the adaptive architecture: two based motions – guarded and compliant – and an additional subsystem com-bining learning and exploration
sensor-This architecture relies on two types of sensor: position ðxÞ and force ðf Þ.Throughout this work, position and orientation of the tool frame are obtainedfrom the robot joint angles using the kinematic equations Force measurementsare obtained from a wrist-mounted strain gauge sensor It is assumed that allsensors are calibrated, but uncertaintycannot be absolutelyeliminated due tosensor noise and calibration imprecision The systemÕs output is the end-
Trang 5effector velocityðvÞ in Cartesian coordinates, which is translated to joint ordinates bya resolved motion rate controller:
co-_
Since the work space of the fine motion task is limited to a small region, thesingularities of J are not important in this framework
3.1 The insertion plan
Uncertaintyin the location of the part and the hole prevents the success of asimple position-based plan Contact between parts has to be monitored, anddifferent actions are needed to perform a correct insertion Other approacheshave tried to build a plan byconsidering all the possible contact states, but theyhave onlysucceeded in simple planar tasks In addition, uncertaintyposesdifficulties for identifying the current state
The proposed insertion plan consists of three steps, which are inspired byintuitive manipulation skills:
(1) Approach hole until a contact is detected
(2) Move compliantlyaround the hole until contact is lost (hole found).(3) Move into the hole until a contact is detected (bottom of the hole).This strategydiffers from a pure random search in that an adaptationprocedure is performed during the second step The system learns a relation-ship between sensing and action, in an autonomous way, which guides theexploration towards the target Initially, the system relies heavily on explora-tion As a result of experience, an insertion skill is learned, and the mean in-sertion time for the task is considerablyimproved
Trang 6In the above insertion plan, all the steps are force-guarded Starting from afree state and due to the geometryof the task, a contact is gained ifjFzj raises to0.1 kgf, and the contact is lost ifjFzj falls below 0.05 kgf This dual-thresholdaccounts for small variations in the contact force due to friction, or uncertainty
3.3 Compliant motions
Once a contact is achieved, motion is restricted to a surface In practice, twodegrees of freedom ðX ; Y Þ are position-controlled, while the third one ðZÞ isforce-controlled Initially, random compliant motions are performed, but arelationship between sensed forces and actions is learned, which decreases thetime needed to insert the part
During the third step, a complementarycompliant motion is performed Inthis task, when the part is inserted, Z is position-controlled, whileðX ; Y Þ areforce-controlled
3.4 Exploration and learning
Random search has been proposed in the literature as a valid tool fordealing with uncertainties [19] However, the insertion time greatlyincreaseswhen the clearance ratio decreases In the proposed architecture (see Fig 3), anadaptation process learns a relationship between sensed states and actions,which guides the insertion task towards completion with the minimum number
Trang 7action This value represents the amount of reinforcement which is expected inthe future, starting from the state, if the action is performed.
The reinforcement (or cost) is a scalar value which measures the qualityofthe performed action In our setup, a negative constant reinforcement is gen-erated after everymotion The learning algorithm adapts the values of the table
so that the expected reinforcement is maximized, i.e., the number of actions(cost) to achieve the goal is minimized
The discrete nature of the reinforcement learning algorithm poses the cessityof extracting discrete values from the sensor signals of force and posi-tion This feature extraction process along with the basis of the learningalgorithm is described below
ne-3.4.1 Feature extraction
Force sensing is introduced to compensate for the uncertaintyin positioningthe end-effector It does a good job when a small displacement causes a contact,since a big change in force is detected However, with onlyforce signals it isnot always possible to identify the actual contact state, i.e., different contactsproduce similar force measurements, as described in [20]
The adopted solution is to combine the force measurements with the relativedisplacement of the end-effector from the initial position, i.e., that of the firstcontact between the part and the surface
The next problem is the discretization of the inputs, which is a requirement
of the learning algorithm There is a conflict between size and fineness With afine representation, the number of states is increased, thus slowing down theconvergence of the learning algorithm Solutions are problem-dependent, usingheuristics for finding a good representation of manageable size
We have obtained good results with the division of the exploration space inthree intervals along each position-controlled degree of freedom For cylin-drical parts, the XY-plane of the exploration space is divided into nine regions –
a 3 3 grid For non-cylindrical parts, the rotation around Z-axis has to beconsidered too, thus the total number of states is 27 Region limits are fixedaccording to the estimated uncertaintyand the radius of exploration
Though the force space could be partitioned in a similar way, an vised clustering scheme is used In a previous work [20] we pointed out thefeasibilityof unsupervised learning algorithms, particularlyKohonenÕs self-organizing maps (SOMs) [21], for extracting feature information from sensordata in robotic manipulation tasks
unsuper-An SOM is a lattice of units, or cells Each unit is a vector with as manycomponents as inputs to the SOM Though there is a neighborhood relation-ship between units in the lattice, this is onlyused during the training of the mapand not in our scheme
SOMs perform a nonlinear projection of the probabilitydensityfunction ofthe input space onto the two-dimensional lattice of units Though all the six
Trang 8force and torque signals are available, the practical solution adopted is to useonlythe three torque signals as inputs to the map The reason for this is thestrong correlation between the force and the torque; thus, adding those cor-related signals does not include anynew information to the system.
The SOM is trained with sensor samples obtained during insertions Aftertraining, each cell or unit of the map becomes a prototype or codebook vector,which represents a region of the input space The discretized force state is thecodebook vector which comes the nearest (measured bythe Euclidean distance)
to the analog force values
The number of units must be chosen a priori, seeking for a balance betweensize and fineness In the experiments, a 6 4 map is used, thus totalling 24force discrete states Since the final state consists of position and force, thereare 9 24 ¼ 216 discrete states in the cylindrical insertion, and 27 24 ¼ 648discrete states in the non-cylindrical task
3.4.2 Reinforcement learning
The advantage of the proposed architecture over other random approaches
is the abilityto learn a relationship between sensed states and actions As thesystem becomes skilled, this relationship is more intensely used to guide theprocess towards completion with the minimum number of steps
The system must learn without a teacher The skill measurement is the time
or number of steps required to perform a correct insertion and is expressed interms of cost or negative reinforcement
Sutton [22] defined reinforcement learning (RL) as the learning of a ping from situations to actions so as to maximize a scalar reward or rein-forcement signal
map-Q-learning [23] is an RL algorithm that can be used whenever there is noexplicit model of the system and the cost structure This algorithm learns thestate–action pairs which maximize a scalar reinforcement signal that will bereceived over time In the simplest case, this measure is the sum of the futurereinforcement values, and the objective is to learn an associative mapping that
at each time step selects, as a function of the current state, an action thatmaximizes the expected sum of future reinforcement
In Q-learning, a look-up table of Q-values is stored in memory, one Q-valuefor each state–action pair The Q-value is the expected amount of reinforce-ment if, from that state, the action is performed and, afterwards, onlyoptimalactions are chosen In our setup, when the system performs any action (mo-tion), a negative constant reinforcement is signalled This reinforcement rep-resents the cost of the motion Since the learning algorithm tends to maximizethe reinforcement, cost will be minimized, i.e., the system will learn those ac-tions which lead to the goal with the minimum number of steps
The basic learning step consists in updating a single Q-value If the systemsenses state s, and it performs action a, resulting in reinforcement r and
Trang 9the system senses a new state s0, then the Q-value for ðs; aÞ is updated asfollows:
Qðs; aÞ ð1 aÞQðs; aÞ þ a
3.4.3 Action selection and exploration
During the learning process, there is a conflict between exploration andexploitation Initially, the Q-values are meaningless and actions should bechosen randomly, but as learning progresses, better actions should be chosen tominimize the cost of learning However, exploration cannot be completelyturned off, since the optimal action might not yet be discovered
Some heuristics for exploration and exploitation can be found in the ature In the implementation, we have chosen the Boltzmann exploration: theQ-values are used for weighing exploitation and exploration The probability
liter-of selecting an action a in state s is
pðs; aÞ ¼
expQðs;aÞT P
where T is a positive value, which controls the degree of randomness, and it isoften referred to as temperature It graduallydecays from an initial value, andexploration is turned off when it is close to zero, since the best action is selectedwith Probability1
4 Experimental results
The system has been implemented in a robot arm equipped with a mounted force sensor (Fig 4) The task is the insertion of pegs of differentshapes (circular, square and triangular section) into their appropriate holes.Pegs are made of wood, and the platform containing the holes is made of asynthetic resin
wrist-Uncertaintyin the position and orientation is greater than the clearancebetween the pegs and holes The nominal goal is specified bya vector and arotation matrix relative to an external fixed frame of reference This location issupposed to be centered above the hole, so the peg would be inserted just bymoving straight along the Z axis with no rotation if there were no uncertainty
Trang 10present After positioning over the nominal goal, the robot performs a guardedmotion towards the hole.
If the insertion fails, the robot starts a series of perception and action cycles.First, sensors are read, and a state is identified; depending on such state, oneaction or another is chosen, and the learning mechanism updates the internalparameters of decision The robot performs compliant motions, i.e., it keepsthe contact with the surface while moving, so that it can detect the hole byasudden force change due to the loss of contact
To avoid long exploration cycles, a timeout is set which stops the process ifthe hole is not found within that time In this case a new trial is started.4.1 Case of the cylindrical peg
The peg is 29 mm in diameter, while the hole is chamferless and 29.15 mm indiameter The clearance between the peg and the hole is 0.075, thus the clear-ance ratio is 0.005 The peg has to be inserted to a depth of 10 mm into the hole.The input space of the self-organizing map is defined bythe three filteredtorque components The map has 6 4 units The map is trained off-line withapproximately70,000 data vectors extracted from previous random trials.Once the map is trained, the robot performs a sequence of trials, each ofwhich starts at a random position within an uncertaintyradius of 3 mm Toensure absolutelythat the goal is within the exploration area, this area is set to
a 5 mm square, centered at the real starting position Exploration motions are
Fig 4 Zebra Zero robot arm, grasping a peg over the platform.
Trang 11tangential to the surface, i.e., along the X and Y dimensions The explorationspace is partitioned into nine regions – limits between regions are)2 and +2
mm awayfrom the initial location for both X and Y Each of these regionsdefine a qualitative location state The state is determined bycombining thewinner unit of the map and this relative qualitative position with respect to theinitial location, thus the total number of states is 24 9 ¼ 216
Contact is detected simplybythresholds in the force component Fz(normal
to the surface) During compliant motions, a force Fz equal to )0.15 kgf isconstantlyexerted on the surface
The action space is discretized Exploratorycompliant motions consist offixed steps in eight different directions of the XY -plane, with some degrees offreedom ðXY Þ being position-controlled and other degrees ðZÞ being force-controlled The complexityof the motion is transferred to the control modules,and the learning process is simplified
4.1.1 Learning results
The learning update step consists in modifying the Q-value of the previousstate and the performed action according to the reinforcement and the value ofthe next state The agent receives a constant negative reinforcement for eachaction it performs (action-penaltyrepresentation) The best policy, the one thatmaximizes the obtained reinforcement, is the one achieving the goal with theminimum number of actions Experimental results are shown in Fig 5 Thecritical phase is the surface-compliant motion towards the hole The systemmust learn to find the hole based on sensoryinformation The exploration time
Fig 5 Smoothed insertion time taken on 4000 trials of the cylinder task.