Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid

It premises exponentially decaying traces related to the richness of the environment, and conditioned reinforcement as the average of such traces over the extended stimulus, yielding an

Trang 1

Contents lists available atScienceDirect

Behavioural Processes

j o u r n a l h o m e p a g e :w w w e l s e v i e r c o m / l o c a t e / b e h a v p r o c

Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid

Peter R Killeen∗

Department of Psychology, Box 1104, McAllister St., Arizona State University, Tempe, AZ 85287-1104, United States

a r t i c l e i n f o

Article history:

Received 15 August 2010

Received in revised form

24 December 2010

Accepted 27 December 2010

Keywords:

Delay of reinforcement gradients

Discounting

Forced choice paradigms

Magnitude effect

Matching paradigms

Reinforcement learning

Trace decay gradients

a b s t r a c t

Behavior such as depression of a lever or perception of a stimulus may be strengthened by consequent behaviorally significant events (BSEs), such as reinforcers This is the Law of Effect As time passes since its emission, the ability for the behavior to be reinforced decreases This is trace decay It is upon decayed traces that subsequent BSEs operate If the trace comes from a response, it constitutes primary rein-forcement; if from perception of an extended stimulus, it is classical conditioning This paper develops simple models of these processes It premises exponentially decaying traces related to the richness of the environment, and conditioned reinforcement as the average of such traces over the extended stimulus, yielding an almost-hyperbolic function of duration The models account for some data, and reinforce the theories of other analysts by providing a sufficient account of the provenance of these effects It leads to

a linear relation between sooner and later isopreference delays whose slope depends on sensitivity to reinforcement, and intercept on that and the steepness of the delay gradient Unlike human prospective judgments, all control is vested in either primary or secondary reinforcement processes; therefore the use of the term discounting, appropriate for humans, may be less descriptive of the behavior of nonverbal organisms

1 Introduction

Pigeons cannot reliably count above 3 (Brannon et al.,

2001; Nickerson, 2009; Uttal, 2008), have short time-horizons

(Shettleworth and Plowright, 1989), may be stuck in time (Roberts

and Feeney, 2009), do not ask for the answers to the questions they

are about to be asked (Roberts et al., 2009), and fail to negotiate an

amount of reinforcement commensurate with the work that they

are about to undertake (Reilly et al., 2011) How do such simple

creatures discount future payoffs as a function of their delay? It is

the thesis of this paper that they do not That the orderly data in

such studies is the simple result of the dilution of the conditioned

reinforcers which support and guide that choice, as a function of

the delay to the outcome that they signal

Classic and generally accepted concepts of causality preclude

events from acting backward in time Then what sense do we make

ofFig 1, a familiar rendition of the control exerted by delayed

rein-forcers? How do the animals know what is coming? Only three

accounts come to mind (a) Precognition But causality rules that

out (b) It is memory of a past choice that makes contact with

rein-forcement; the figure should be reversed Or (c) the animals have

learned what leads to what There follows an extended argument

∗ Tel.: +1 480 967 0560; fax: +1 480 965 8544.

E-mail address: killeen@asu.edu

that (b) and (c) are both true, and that in novel contexts, (b) typically leads to (c)

When in the course of an animals’ behavior a behaviorally sig-nificant event (BSE; or phylogenetically important event (Baum,

2005); or more familiarly, incentive, reinforcer, or unconditioned stimulus) occurs, there immediately arises the question of whence

In computer science this is the assignment of credit problem If the organism, or software, takes into account events in the last instant, there are r potential causes for the BSE, where r is a mea-sure of the richness of context An additional r events occurred in the prior, penultimate instant The combination of any one of these with those in the ultimate instant could have been the causal chain that led to the BSE: r2 sequences in toto Extending the account further, to the antepenultimate instant, raises the pool to r3 Con-tinue this process back and the candidate pool of sequences grows

as rn, where n is the depth of query If each of these instants of apprehension lasts ı s, then n = d/ı, and the candidate path grows

as rd/ı, where d is the delay between event and consequence In the continuous limit, this equals ed/, where is the time constant

of the traces – the inverse of the continuous limit of the richness parameter r This means that the gradients get steeper in rich envi-ronments: = 1/r It follows that any one causal path is eligible for 1/ed/ of the credit for reinforcement, everything being equal

Of course everything is not equal: the priors on some events are higher than on others, either because of their phylogenetic rele-vance, or their memorability, which may be enhanced by marking

doi: 10.1016/j.beproc.2010.12.016

Trang 2

Fig 1 Traditional delay of reinforcement gradients to two outcomes of different

incentive value.

Fig 2 Reverse Fig 1 to see these, variously named trace decay, decay of

eligibil-ity for causal status, and decay of memory gradients Gradients are shown for two

responses of different memorability.

their occurrence with salient stimuli Allowing for such bias,

rep-resented by the parameter c, we would expect the causal impact to

decrease with time prior as:

s′

which is the point association1 of an event at d seconds remove

from the BSE, as seen inFig 2

This story for why associability between an event and a

subse-quent BSE may decay exponentially, retold fromJohansen et al

(2009)andKilleen (2005), has some empirical support (Killeen,

2001b; Escobar and Bruner, 2007) Eligibility traces play a central

role in AI reinforcement learning (Singh and Sutton, 1996) Classic

models such as Sutton and Barto’s posit a geometrically decreasing

representation of events similar to that developed here, and work to

reconcile details of instrumental and Pavlovian conditioning with

various instantiations of such traces (Sutton and Barto, 1990; Niv

et al., 2002) Alternatively, it is possible to simply posit exponential

or hyperbolic decay of memory of the stimulus, and also that these

traces may or may not vary with the richness of the environment

This has been the productive tactic of most analysts of delay

dis-counting If this disposition is good enough for you, skip the next 3

pages

What is the purported mechanism? As developed here it

is one of stimulus competition, with richer environments and

greater interludes providing more opportunities for interference

A stimulus-sampling model of acquisition (Bower, 1994; Estes and

Suppes, 1974; Neimark and Estes, 1967; Estes, 1950) provides

the basis of a model of acquisition in the face of such

contingen-1 If the duration of a response is ı s, then the impact of reinforcement on it is

given by the integral of Eq (1) from d to d + ı For brief events such as responses,

this essentially equals ı times the right-hand side of Eq (1) For responses of similar

durations this coefficient is absorbed by c.

Fig 3 Eligibility traces of a response at increasing temporal removes from a reinforcer At greater removes, the right tails have lower associability with rein-forcement, as indicated by their height where they intersect the right ordinate Graphing that height above the temporal distance gives the dashed curve, the delay

of reinforcement gradient.

cies degraded by delay and distraction (Killeen, 2001a) It is not repeated here Another way to think of Eq.(1)is as a measure of the signal-to-noise ratio of a delay contingency In the case c = 1/,

Eq.(1)describes a probability distribution, so that identification of one point from the distribution reduces candidate uncertainty by log2(e) bits

What is the relation between eligibility traces and the delay of reinforcement gradient?Fig 3shows 7 trace gradients for events occurring more and more remote from the BSE The most proximate occurs at the moment of reinforcement, and is visible only as a dot in the upper right corner; it receives the full credit for which it might

be eligible An event occurring 1 time step earlier has an impact diluted by about 30% by the time of reinforcement, as inferred from where its trace cuts the origin, the zero delay axis at the right of the graph Draw this measure of eligibility, 0.7, out 1 unit from the right frame, as shown by the arrow, and connect it to the full measure

in the corner by a dashed line The event 2 steps back decays by about 50% at the time of reinforcement; draw a line from there extending to the left at 2, and continue the dashed line to it When bored of this construction, stop to consider the shape of the delay of reinforcement gradient – the dashed line When smoothed, it will have exactly the same shape as any of the decay traces, but will be reflected about its new origin at 0

The distinction between these two representations, one of pro-cess and the other of product, is important AsFig 2makes clear, what is present at the time of reinforcement is a decayed trace of a response Differential reinforcer magnitude can have no retroactive effect on the shape or elevation of those traces Reinforcers of dif-ferent magnitudes do not change the decay gradients, but rather act differentially on their tails: a larger reinforcer may be more effec-tive at leveraging the same residual memory than a small one But those tails may be of different elevation – and thus differentially able to receive the effect of the reinforcement – because they are more or less memorable (reflected in c) or because they occur in a richer or bleaker environment (reflected in )

2 Hyperbolic dilemmas How can gradients be exponential when everyone says that they are hyperbolic? The curves inFig 1do not cross, whereas most representations of discounted future events of differing value do These three figures address the associability of a discrete event at a remove of t from reinforcement They do not address situations in which that event leads to an immediate change of state signaling a deferred outcome A signal of change of state marks the precipitat-ing event by immediately sprecipitat-inglprecipitat-ing it out as the precursor of a better (or possibly worse) state of affairs Consider a response that causes the onset of a stimulus, and after a delay of d, a BSE Assume that each of the temporal elements of the stimulus receives associations

as given by Eq.(1), and that these are otherwise equivalent in time

Trang 3

Fig 4 Disks: the decreasing efficacy of a primary reinforcer as a function of the delay

between it and a response The continuous curve is given by Eq (1) ; the dashed curve

by Eq (3) Squares: the decreasing efficacy of a conditioned reinforcer as a function

of the maximum delay it signals The continuous curve is given by Eq (2) ; the dashed

curve superimposed on it by Eq (3) The data are from Richards (1981)

(that is, that the parameters of Eq.(1)do not change over the delay)

In the case that each element of the stimulus is highly generalizable

with the next, these associations add linearly, giving a total

associa-bility equal tod

0 ce−t/dt This integral assumes that the temporal

elements dt make linearly independent contributions to the total

association Because one element of the stimulus is, per hypothii,

indiscriminable from the next, any one element – in particular the

one just following a response – has an average associability given

by:

¯sd=

d

0

ce−t/dt

d

0

dt

¯sd=c(1 − e−d/)

d

(2)

Eq (2) is not discriminable from the inverse linear relation

known as hyperbolic (Killeen, 2001a).Fig 4demonstrates this

sim-ilarity by fitting both Eq.(2), and the hyperbola

shyp= c

to data from Richards (1981) that describe the effects of

sig-naled delayed reinforcement on the average response rates of four

pigeons The curves through the squares superimpose This makes

sense, as Eq.(3)is a series approximation2to Eq.(2)

Experienced laboratory animals can tell the difference between

the start of a long delay and the start of a short one; they are

sen-2

e −d/= 1

e d/≈

1

1 + d/ + · · ·

∴

¯sd= c(1 − e − d/ )

c

d

1 − 1

1 + d/

¯sd≈ c

1 + d/

The average absolute deviation between Eqs (2) and (3) over the range from

0.99 to 0.04 is 0.064; however letting the time constant in either equation vary

from its value in the other reduces this deviation to 0.023, within experimental

error.

The exponential term may also be approximated with the more standard Maclaurin

series: e −d/=1 − d/ + (d/) 2 /2! − , but the first approximation is everywhere

more accurate The latter approximation deviates from Eq (2) by 4.6 (against 0.06),

reduced to 0.33 (against 0.02) by refitting .

The limit of Eq (2) as d goes to 0 is c, as may be demonstrated using l‘Hôpital’s rule.

Fig 5 The decreasing efficacy of a reinforcer in establishing a new response as a function of the delay between it and a response The continuous curve is exponential, the dashed curve hyperbolic Error bars are the standard errors of the means The data are from Wilkenfield et al (1992)

sitive to time and delay (Moore and Fantino, 1975) The use of Eq (2)requires that, facing start of a long delay to food and a stimulus which – in the best of times – is contiguous with food, control by the stimulus dominates that by time Animals, in other words, are optimists: their behavior is primarily under the control of the most hopeful stimuli rather than some weighted average of predictive stimuli There is good evidence that this is often the case (Horney and Fantino, 1984; Sanabria and Killeen, 2007; Jenkins and Boakes,

1973)

Also shown in Fig 4is the decay trace for unsignaled rein-forcement Under the hypothesis of the prior section, it is given

by Eq.(1), an exponential function, shown as the continuous curve passing near the disks, showing response rates for unsignaled (non-resetting) delays Also shown is the hyperbola, Eq.(3), which apparently gives an inferior fit to these data – although this data-base is too limited to make secure generalizations For unsignaled delayed reinforcement, at least in this case, the exponential gra-dients are, as predicted, competitive with the more traditional hyperbolic gradients.Fig 4illustrates Lattal’s generalization that

“The unsignaled delay gradient is characterized by [generally] lower response rates and a steeper slope than the gradient obtained with otherwise equivalent signaled delays” (Lattal, 2010) WhereasFig 4usefully compares the effects of signaled and unsignaled delays, because the unsignaled delays were non-resetting, the actually experienced delays were variable and less,

by an unspecified amount, than the abcissae A better test of the sufficiency of Eq.(1)comes fromWilkenfield et al (1992), using resetting delays, where the abcissae provide accurate representa-tions of the experienced delays These investigators reported the response rates during acquisition of lever pressing from four groups

of rats, nine in each group Their data from the first 100 min of acquisition are shown inFig 5 Again, the exponential provides a plausible model

The simple hyperbolic model has been shown adequate for most discount functions for non-verbal animals (Green and Myerson, 2004; Ong and White, 2004; Green et al., 2004) But unlike its cousin the hyperbola, which is ad hoc, Eq.(2)has some theoretical moti-vation: it predicts radical changes in preference as a function of the nature and continuity of the stimuli that bridge the delay between response and BSE, and holds out the promise for quantifying those effects It is consistent with the important role of conditioned rein-forcers in preference for delayed outcomes (Williams and Dunn,

1991), and provides a useful refinement to a unified theory of choice (Killeen and Fantino, 1990) In the latter theory, and its precedent (Killeen, 1982a,b), the control by a delayed reinforcer was mod-eled as the sum of both the primary (i.e., point association with the

Trang 4

the association of streams of responses with reinforcement, is the

heart of the model of coupling in my theory of schedule effects,

MPR (Killeen, 1994)

The presence of stimuli occurring between a response and

BSE may not always be beneficial to conditioning the response

Brief stimuli occurring immediately after a response (marking it)

may make the response more memorable when the BSE occurs

(Lieberman et al., 1979; Thomas et al., 1983) – perhaps by

increas-ing the value of c Alternatively, such stimuli may initiate adjunctive

behavior that serves as an extended conditioned stimulus (CS)

(Schaal and Branch, 1988) Conversely, brief stimuli occurring just

before reinforcement may block control by the response–reinforcer

association (Pearce and Hall, 1978).Williams (1999)andReed and

Doughty (2005)demonstrated the power of both effects in the same

experiments Whether the effects of primary and secondary

rein-forcement add or interfere depends on the correlation of each of the

contingencies with the behavior measured by the experimenter:

a CS whose presentation is not contingent on behavior will only

adventitiously strengthen the target response, and, depending on

temporal variables, is as likely to compete with it; furthermore,

one which signals non-contingent reinforcement will compete

with concurrent instrumental responses (Miczek and Grossman,

1971) A CS presented on the instrumental operandum can enhance

response rate, whereas one presented on a different operandum

can compete with it (Schwartz, 1976) As the duration of a

mark-ing stimulus extends into the delay interval, integration of Eq.(2)

between its endpoints predicts a positively accelerating

effective-ness of the stimulus.Schaal and Branch (1990)found the predicted

increase, but it was negatively accelerated for 2 of the 3 pigeons

The association of a CS or response with the measured behavior

will also depend on the modality of the CS, the modality of the

response (Timberlake and Lucas, 1990), and the contingencies that

make the correlation tight or weak (Killeen and Bizo, 1998) For the

present argument, these correlations of response and CS with the

experimenter’s dependent variable are carried by the constant c

3 The effects of delay on choice

To apply Eq.(2)to experiments in which an animal is choosing

between delayed reinforcers of different magnitudes (a) requires

a scale that maps amount into reinforcing effectiveness Perhaps

the simplest “utility” function for reinforcement amount is the

power function, which is the form assumed in the generalized

matching law (Rachlin, 1971; Killeen, 1972; Baum, 1979) It has

the advantage of simplicity, and fits most of the available data

over its limited range A disadvantage is that it has the

effective-ness of reinforcement growing without bound as the amount is

increased, which is implausible.Rachlinhas derived other forms for

utility from first principles (1992); his logarithmic, andmy (1985)

exponential-integral can also accommodate data, as can Bradshaw

and associates’ hyperbolic discounting of amount (Bezzina et al.,

2007) However, the equations look simpler if we adopt the

for-malism of the generalized matching law in which the reinforcing

power of amount is the power function, u(a) = a˛ Then the

asso-ciative strength of a response immediately followed by a stimulus

change, and d later a BSE of physical magnitude a, is the product

of the impact of the BSE, a˛, on the sum of the primary s′

d and secondary ¯sdeffects Assuming for parsimony that in the cases

anal-ysed the relative salience of stimulus elements and responses are

comparable, then cprimary≈csecondary= c, and:

sd,a=a˛c

e−d/+(1 − e−d/)

d

Fig 6 Data from an experiment by Green and associates (2004) in which the amount delivered to pigeons immediately (1/2 s delay) was adjusted to indifference with that given after the delay noted on the x axis The parameter is the magnitude of the delayed reinforcer The curves are drawn by Eqs (5) and (6)

3.1 Methods of adjustment

Psychophysical paradigms in which variables are adjusted to cause indifference in preferences or other judgments – “Matching paradigms” (Farell and Pelli, 1999) – are more secure of interpreta-tion than those involving a psychological scale, such as one of value (Hand, 2004; Uttal, 2000) Their units are physical measurements, and they refer to a unique psychological point, that of equivalence This may be determined whether the underlying scale is interval, ordinal, or even nominal

How great must an amount a1be to balance a different amount

a2at a different delay? Set

a1˛ce−d1/+(1 − e−d1/)

d1

=a2˛ce−d2/+(1 − e−d2/)

d2

and solve for a1:

a1

a2

=

e−d2/+(1 − e−d2/)/d2

e−d1/+(1 − e−d1/)/d1

1/˛

(5)

Eq.(5)gives the relative equivalent value of amount a2delayed

d2, compared to an alternative delayed d1 Typically, d1is “imme-diate” – that is, around 1/2 s, and then Eq.(5)gives the relative immediate equivalent amount With d2> d1, this ratio will be less than 1, indicating that a smaller immediate amount, relative to a2, suffices to balance the latter at a remove of d2 Note that neither amount appears in the right hand side; no magnitude effect is pre-dicted: as long as the ratio of delays is the same, the predictions are the same when both amounts are multiplied by a constant In gen-eral, no magnitude effect is found in delay discounting experiments with non-human animals (Green et al., 2004; Ong and White, 2004) Fig 6shows the course of Eq.(5), with ˛ = 1.26 and = 2.12 s, passing near the average data from four pigeons in an experiment where the amount delivered after 1/2 s was adjusted to maintain indiffer-ence between it and a larger amount (given by the parameter in figure) delivered at a delay

The primary and conditioned reinforcing effects are highly cor-related; Eq.(5)may be simplified by deleting the primary influence

of the reinforcers on the choice responses, to yield:

a1

a2

=

d1(1 − e−d2/)

d2(1 − e−d1/)

1/˛

which draws the continuous curve through the data inFig 6 But the primary and secondary effects may be dissociated, and when they are, alternatives with both are preferred to those with just primary reinforcement (Marcattilio and Richards, 1981; Lattal, 1984) The hyperbolic approximation to Eq.(6)provides a decent fit to these

Trang 5

Fig 7 Data from experiment 2 of Fox et al (2008) studying relative choice of 3

pellets delayed vs 1 immediate in two strains of rats, with curves drawn by Eq (8).

Here “immediate” is set at 1/2 s.

data as well, but falls noticeably farther from their average than do

Eqs.(5) and (6)

In some matching experiments, the delay to one outcome is

adjusted, rather than the amount Eq.(7)yields no simple

predic-tion, but invoking the series expansion of the exponential term2

that was used in going from Eqs.(2)and(3):

¯sd

i ,ai≈

a˛

ic

1 + di/,

leads to the simple linear relation of Eq.(7)

d2=d1

a2

a1

˛

+a2

a1

˛

Operations that increase the sensitivity to reinforcement

(increase ˛) or flatten the gradient (increase ) will increase the

indifference point, d2 The provenance of the effect can be

deter-mined by manipulating d1, as the former will increase both slope

and intercept, and the latter only intercept Some drugs, such as

stimulants, may decrease ˛ while increasing (Maguire et al., 2009;

Pitts and Febbo, 2004), and their results will thus vary as a

func-tion of the balance between the two, largely determined by the

value of d1 A linear equation such as(7), based on multiplicative

hyperbolic functions of amount and delay, was proposed and

val-idated byMazur (2001), and independently by Bradshaw’s group

(Ho et al., 1999; Bezzina et al., 2007; da Costa Araújo et al., 2009)

In Bradshaw’s model, as in Eq.(7), the slope depends on relative

payoffs regulated by the amount amplifier parameter ˛, and the

intercept on a multiplicative function of that and delay

sensitiv-ity Their model has also been applied to human delay discounting

(Hinvest and Anderson, 2010; Liang et al., 2010)

3.2 Methods of forced choice

An alternative psychophysical procedure involves the

measure-ment of the degree of preference between two fixed alternatives, or

the frequency of choosing one over the other Eq.(4)may be

rear-ranged to predict the outcome of choice experiments in which the

delays and outcomes are invariant The relative associative strength

of the alternatives is:

sd1,a1

sd1,a1+sd2,a2

=

1 +a2

a1

˛e−d2/+(1 − e−d2/)/d2

e−d1/+(1 − e−d1/)/d1

−1

(8)

In the case of unbiased choice there are two free parameters,

the rate of diminishing marginal utility for larger amounts, ˛, and

the time constant of the memory trace, Note that amounts again

appear as a ratio, indicating scale invariance: there is no magnitude

effect.Fig 7shows this model follows a path similar to the data of

Fox et al (2008), who asked whether rat models of ADHD (SHRs) would show steeper delay gradients than control (WKY) rats They did Other investigators (Adriani et al., 2003) did not find steeper gradients for SHR, but observed very large individual differences As noted byOrdu ˜na et al (2007)the main effect found by Fox and asso-ciates may be due to idiosyncrasies of their control rats (Sagvolden

et al., 2009)

4 Discussion Prospective judgments of equivalent amounts by humans, typ-ical in the delay-discounting literature, require computations that are different in kind from those of paradigms in which real delays are conditioned to discriminative stimuli Humans can be instructed to contemplate the desirability of ten thousand dol-lars in ten years, and to stipulate how little they would settle for one week hence in lieu of it The performance entails a scale

of future time, the value of an outcome deferred by that delay, and concatenation of the non-linear time-scale with a non-linear amount scale, from which a variety of results are imaginable (Killeen, 2009; Rachlin, 2006) Little wonder that there are dif-ferences in covering models The only way to so instruct other animals is to expose them to such realities repeatedly The asser-tion in the opening of this paper that the future cannot act on non-verbal animals was meant to emphasize this difference: on the one hand verbally presented unexperienced hypotheticals that can control human responses, and on the other the conditioning of behavior reinforced by the presentation of conditioned reinforcers signaling real, experienced, delays, that controls pigeon and rat behavior

This paper should be read as a grounding of hyperbolic models of delay discounting, not a critique of them It presented a few ideas First, it is observed thatFig 1is not a model of a process It is a summary of some other kind of process, such as the one proposed in Fig 3 The distinction is important, as thinking ofFig 1as a process can be misleading I am not alone in this concern:

In this [Fig 1] view, reinforcers reach back in time to effect this response in the presence of the remembered stimulus

As a model of how an animal adapts to, or learns about, situations with stimulus–behavior delays and response–reinforcer delays, the model has the problem of reinforcer effects spreading back-ward in time Physiologically, the process cannot act in this way, and physiology must require that the memory of an event flows forward in time, rather than the reinforcer effect flowing back-wards But the response-centric view is the dominant view in the study of delayed reinforcers and of self control

A simpler, much more likely, and physiologically consistent conceptualization of the adaptation to these delays is shown

in [Fig 2] In this view, at the point at which a reinforcer is delivered, it is the conjunction of the memories of both the stimulus and the response at the time of reinforcer delivery that is “strengthened” and, I presume, remembered and subse-quently accessed and used This approach suggests a different, and more parsimonious, mechanism for learning and activ-ity that is squarely based on memory When reinforcers are delayed, it is the residual memory of responses times the value

of the reinforcers that will describe the effects of reinforcer delay

on behavior When responses are delayed following stimuli, it

is the residual memory of the stimulus times the value of the reinforcer that will describe the stimulus–reinforcer conjunc-tion, providing a role for stimulus–reinforcer relations (as in momentum theory) (Davison, 2006)

The present paper constitutes simply the endorsement of the first paragraph and one realization of the second paragraph

Trang 6

colleague felt that that such grounding is unnecessary, as the

hyper-bola is justified by its ubiquitous accuracy in characterizing the

‘discounting’ data, that the rationale supporting hyperbolic

dis-counting does not rely on the validity or even plausibility of any

internal mechanism Rather that it relies on its predictive ability

on its own level, the overt behavior of the whole organism, and

its applicability in the real world So, why all the above talk about

associations, decaying traces, and assignment of credit? Because, I

plead, it puts some meat on the bones, holds out a hand of

trans-lation to AI reinforcement theorists, and turns ‘round a figure got

backward But, chacon a’ son gỏt

A third idea expressed in this paper is that simple processes of

decay (Eq.(1)) and average decay (Eq.(2)) represent behavioral

processes that are void of cognitive representations That is not the

case for human delay discounting, as the vast majority (though not

all) of the data from it involves hypothetical amounts and delays

that are communicated verbally, and have never and will never be

experienced by the individual The present treatment is thoroughly

behavioral The use of mathematics to represent the conditioning

processes has been misunderstood by some colleagues as asserting

that the animals must perform such computations That is true in

the same sense that a rope suspended at two points evaluates a

catenary equation The calculations of pigeon and rope, such as they

are, are embodied, not computed; the mathematical representation

derives from the scientist, not from the thing he or she uses it to

describe

The final idea is the importance of the distinction between

dif-ferent mensuration paradigms The matching paradigm, some of

whose results are displayed inFig 6, is different in kind than the

forced-choice/preference paradigm, some of whose results are

dis-played inFig 7 Why should an animal who prefers alternative A to

alternative B not always choose A; but rather choose it, say, only 70%

of the time? It does not suffice to say “because it matches”, which

offers a result in the guise of an explanation To decline the thing

you prefer, you must have balancing considerations, such as cost, or

novelty; or be confused; or be irrational.Mazur (2010)has shown

that in the simple forced choice paradigm non-exclusive

prefer-ence may be due to experimental designs that confuse the animal

That possibility is exacerbated in the concurrent chain version of

the forced-choice paradigm Sub-exclusive preference there occurs

not because the other 30% of the time the animal prefers B (how

often would you choose $30 over $70, once the pleasure of

thwart-ing the experimenter has paled?) – but because the contthwart-ingencies

of reinforcement have made the probability of getting B sufficiently

greater at that point in time, primed and awaiting collection, with

the preferred A never any closer3

The way in which probabilities on concurrent schedules bend

preference from rational exclusivity toward matching was nicely

demonstrated byCrowley and Donahoe (2004) But these

evolv-ing probabilities are typically treated as externals, measured (e.g

Boutros et al., 2009; Davison and Baum, 2003) analysed (MacDonall,

2000, 2005) and modeled (e.g.,Grace et al., 2006) in their own right

Unfortunately, that research seldom changes the interpretation of

relative rates as prima facie measures of preference The

dynam-ically evolving probabilities that concurrent VIs schedule are an

intrinsic part of the package the animal must dynamically balance

– not a neutral tool to measure it When the negative feedback

inherent in those schedules is eliminated in adjustment paradigms

where confusion is minimized, animals just about always choose

3 On random interval VI schedules with mean m, the probability of reinforcement

on the same key one second after the last peck is always 1/m, whereas on the other

it increases toward 1 as 1 − e − t/m , with t the time since the last changeover.

Magnitude, delay, and probability of reinforcement interact to control choice in concurrent schedules (Elliffe et al., 2008) Some interaction is allowed by Eq.(8) due to its many nonlinearities, giving more weight to delay differentials as both delays increase ButIto and Asaki (1982)found substantial monotonic increases

in rats’ preference for 3 vs 1 pellets as the equal delays to their receipt increased.Ong and White (2004)noted other instances of this effect, and attributed it to increased sensitivity to reinforcer amount when reinforcers are delayed But it is not clear how that is anything other than a magnitude effect; and thus at odds with the results from matching (adjustment) paradigms

Whether due to discrimination failure in simple forced choice,

or negative feedback contingencies in concurrent chain schedules, non-exclusive preferences are an uncertain metric of what animals value The application of Eqs.(5)–(7)for matching paradigms is therefore offered with more confidence than Eq.(8)for concurrent-chain interval schedules, which require a more complex model, such as that ofChristensen and Grace (2010)

Acknowledgements

I thank Tim Cheung and Ryan Brackney for comments, Robert Kessel for insisting on mathematical precision, Tony Nevin for insisting on conceptual clarity as well; and to all for helping to show me how to achieve those desiderata The remaining signif-icant deviations are mine

References Adriani, W., Caprioli, A., Granstrem, O., Carli, M., Laviola, G., 2003 The spontaneously hypertensive-rat as an animal model of ADHD: evidence for impulsive and non-impulsive subpopulations Neurosci Biobehav Rev 27, 639–651.

Baum, W.M., 1979 Matching, undermatching, and overmatching in studies of choice.

J Exp Anal Behav 32, 269–281.

Baum, W.M., 2005 Understanding Behaviorism: Behavior, Culture, and Evolution Blackwell, Malden, MA, p 312.

Bezzina, G., Cheung, T.H.C., Asgari, K., Hampson, C.L., Body, S., Bradshaw, C.M., Szabadi, E., Deakin, J.F.W., Anderson, I.M., 2007 Effects of quinolinic acid-induced lesions of the nucleus accumbens core on inter-temporal choice: a quantitative analysis Psychopharmacology 195, 71–84.

Boutros, N., Elliffe, D., Davison, M., 2009 Time versus response indices affect con-clusions about preference pulses Behav Processes 84, 450–454.

Bower, G.H., 1994 A turning point in mathematical learning theory Psychol Rev.

101, 290–300.

Brannon, E.M., Wusthoff, C.J., Gallistel, C.R., Gibbon, J., 2001 Numerical subtraction

in the pigeon: evidence for a linear subjective number scale Psychol Sci 12, 238–243.

Christensen, D.R., Grace, R.G., 2010 A decision model for steady-state choice in concurrent chains J Exp Anal Behav 94, 227–240.

Crowley, M.A., Donahoe, J.W., 2004 Matching: its acquisition and generalization J Exp Anal Behav 82, 143–159.

da Costa Arẳjo, S., Body, S., Hampson, C.L., Langley, R.W., Deakin, J.F.W., Ander-son, I.M., Bradshaw, C.M., Szabadi, E., 2009 Effects of lesions of the nucleus accumbens core on inter-temporal choice: further observations with an adjusting-delay procedure Behav Brain Res 202, 272–277.

Davison, M., 2006 Behavior-centric versus reinforcer-centric descriptions of behav-ior PsyCrit 12 (November), 1–3.

Davison, M., Baum, W.M., 2003 Every reinforcer counts: reinforcer magnitude and local preference J Exp Anal Behav 80, 95–129.

Elliffe, D., Davison, M., Landon, J., 2008 Relative reinforcer rates and magnitudes do not control concurrent choice independently J Exp Anal Behav 90, 169–185 Escobar, R., Bruner, C.A., 2007 Response induction during the acquisition and main-tenance of lever pressing with delayed reinforcement J Exp Anal Behav 88, 29–49.

Estes, W.K., 1950 Toward a statistical theory of learning Psychol Rev 57, 94–107 Estes, W.K., Suppes, P., 1974 Foundations of stimulus sampling theory In: Contem-porary Developments in Mathematical Psychology.

Farell, B., Pelli, D.G., 1999 Psychophysical methods, or how to measure a threshold and why In: Carpenter, R.H.S., Robson, J.G (Eds.), Vision Research: A Practical Guide to Laboratory Methods Oxford Univ Press, New York.

Fox, A.T., Hand, D.J., Reilly, M.P., 2008 Impulsive choice in a rodent model of attention-deficit/hyperactivity disorder Behav Brain Res 187, 146–152 Grace, R.C., Berg, M.E., Kyonka, E.G.E., 2006 Choice and timing in concurrent chains: effects of initial-link duration Behav Processes 71, 188–200.

Trang 7

Green, L., Myerson, J., 2004 A discounting framework for choice with delayed and

probabilistic rewards Psychol Bull 130, 769–792.

Green, L., Myerson, J., Holt, D.D., Slevin, J.R., Estle, S.J., 2004 Discounting of delayed

food rewards in pigeons and rats: is there a magnitude effect? J Exp Anal Behav.

81, 39–50.

Hand, D.J., 2004 Measurement Theory and Practice Oxford University Press, Inc.,

New York, p 320.

Hinvest, N.S., Anderson, I.M., 2010 The effects of real versus hypothetical reward on

delay and probability discounting Q J Exp Psychol 63, 1072–1084.

Ho, M.Y., Mobini, S., Chiang, T.J., Bradshaw, C.M., Szabadi, E., 1999 Theory and

method in the quantitative analysis of “impulsive choice” behaviour:

implica-tions for psychopharmacology Psychopharmacology 146, 362–372.

Horney, J., Fantino, E., 1984 Choice for conditioned reinforcers in the signaled

absence of primary reinforcement J Exp Anal Behav 41, 193–201.

Ito, M., Asaki, K., 1982 Choice behavior of rats in a concurrent-chains schedule:

amount and delay of reinforcement J Exp Anal Behav 37, 383–392.

Jenkins, H.M., Boakes, R.A., 1973 Observing stimulus sources that signal food or no

food J Exp Anal Behav 20, 197–207.

Johansen, E.B., Killeen, P.R., Russell, V.A., Tripp, G., Wickens, J.R., Tannock, R.,

Williams, J., Sagvolden, T., 2009 Origins of altered reinforcement effects in

ADHD Behav Brain Funct 5, 7.

Killeen, P.R., 1972 The matching law J Exp Anal Behav 17, 489–495.

Killeen, P.R., 1982a Incentive theory In: Bernstein, D.J (Ed.), Nebraska Symposium

on Motivation, vol 1981 Response Structure and Organization, University of

Nebraska Press, Lincoln.

Killeen, P.R., 1982b Incentive theory II: models for choice J Exp Anal Behav 38,

217–232.

Killeen, P.R., 1985 Incentive theory IV: magnitude of reward J Exp Anal Behav 43,

407–417.

Killeen, P.R., 1994 Mathematical principles of reinforcement Behav Brain Sci 17,

105–172.

Killeen, P.R., 2001a Modeling games from the 20th century Behav Processes 54,

33–52.

Killeen, P.R., 2001b Writing and overwriting short-term memory Psychon Bull Rev.

8, 18–43.

Killeen, P.R., 2005 Gradus ad parnassum: ascending strength gradients or descending

memory traces? Behav Brain Sci 28, 432–434.

Killeen, P.R., 2009 An additive-utility model of delay discounting Psychol Rev 116,

602–619.

Killeen, P.R., Bizo, L.A., 1998 The mechanics of reinforcement Psychon Bull Rev,

221–238.

Killeen, P.R., Fantino, E., 1990 A unified theory of choice J Exp Anal Behav 53,

189–200.

Lattal, K.A., 1984 Signal functions in delayed reinforcement J Exp Anal Behav 42,

239–253.

Lattal, K.A., 2010 Delayed reinforcement of operant behavior J Exp Anal Behav.

93, 129–139.

Liang, C.H., Ho, M.Y., Yang, Y.Y., Tsai, C.T., 2010 Testing the applicability of a

multiplicative hyperbolic model of inter-temporal and risky choice in human

volunteers Chin J Psychol 52, 189–204.

Lieberman, D.A., McIntosh, D.C., Thomas, G.V., 1979 Learning when reward is

delayed: a marking hypothesis J Exp Psychol Anim Behav Process 5, 224–242.

MacDonall, J.S., 2000 Synthesizing concurrent interval performances J Exp Anal.

Behav 74, 189–206.

MacDonall, J.S., 2005 Earning and obtaining reinforcers under concurrent interval

scheduling J Exp Anal Behav 84, 167–183.

Maguire, D.R., Rodewald, A.M., Hughes, C.E., Pitts, R.C., 2009 Rapid acquisition of

preference in concurrent schedules: effects of D-amphetamine on sensitivity to

reinforcement amount Behav Processes 81, 238–243.

Marcattilio, A.J.M., Richards, R.W., 1981 Preference for signaled versus unsignaled

reinforcement delay in concurrent-chain schedules J Exp Anal Behav 36,

221–229.

Mazur, J.E., 2001 Hyperbolic value addition and general models of animal choice.

Psychol Rev 108, 96–112.

Mazur, J.E., 2010 Distributed versus exclusive preference in discrete-trial choice J.

Exp Psychol Anim Behav Process 36, 321–333.

Miczek, K.A., Grossman, S.P., 1971 Positive conditioned suppression: effects of CS

duration J Exp Anal Behav 15, 243–247.

Moore, J., Fantino, E., 1975 Choice and response contingencies J Exp Anal Behav.

23, 339–347.

Neimark, E.D., Estes, W.K., 1967 Stimulus Sampling Theory Holden-Day, San

Fran-cisco.

Nickerson, R.S., 2009 Mathematical Reasoning Patterns, Problems, Conjectures, and Proofs Psychology Press, London.

Niv, Y., Joel, D., Meilijson, I., Ruppin, E., 2002 Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors Adapt Behav 10, 5–24.

Ong, E.L., White, K.G., 2004 Amount-dependent temporal discounting? Behav Pro-cesses 66, 201–212.

Ordu ˜na, V., Hong, E., Bouzas, A., 2007 Interval bisection in spontaneously hyperten-sive rats Behav Processes 74, 107–111.

Pearce, J.M., Hall, G., 1978 Overshadowing the instrumental conditioning of a lever-press response by a more valid predictor of the reinforcer J Exp Psychol Anim Behav Process 4, 356–367.

Pitts, R.C., Febbo, S.M., 2004 Quantitative analyses of methamphetamine’s effects

on self-control choices: implications for elucidating behavioral mechanisms of drug action Behav Processes 66, 213–233.

Rachlin, H., 1971 On the tautology of the matching law J Exp Anal Behav 15, 249–251.

Rachlin, H., 1992 Diminishing marginal value as delay discounting J Exp Anal Behav 57, 407–415.

Rachlin, H., 2006 Notes on discounting J Exp Anal Behav 85, 425–435 Reed, P., Doughty, A.H., 2005 Within-subject testing of the signaled-reinforcement effect on operant responding as measured by response rate and resistance to change J Exp Anal Behav 83, 31–45.

Reilly, M.P., Posadas-Sanchez, D., Kettle, L.C., Killeen, P.R., 2011 Making the trip worthwhile: do rats (Rattus norvegicus) and pigeons (Columba livia) forage prospectively? Behav Processes, in review.

Richards, R.W., 1981 A comparison of signaled and unsignaled delay of reinforce-ment J Exp Anal Behav 35, 145–152.

Roberts, W.A., Feeney, M.C., 2009 The comparative study of mental time travel Trends Cogn Sci 13, 271–277.

Roberts, W.A., Feeney, M.C., McMillan, N., MacPherson, K., Musolino, E., Petter, M.,

2009 Do pigeons (Columba livia) study for a test? J Exp Psychol Anim Behav Process 35, 129–142.

Sagvolden, T., Johansen, E.B., Wøien, G., Walaas, S.I., Storm-Mathisen, J., Bergersen, L.H., Hvalby, Ø., Jensen, V., Aase, H., Russell, V.A., Killeen, P.R., DasBanerjee, T., Middleton, F.A., Faraone, S.V., 2009 The spontaneously hypertensive rat model

of ADHD—the importance of selecting the appropriate reference strain Neu-ropharmacology 57, 619–626.

Sanabria, F., Killeen, P.R., 2007 Temporal generalization accounts for response resur-gence in the peak procedure Behav Processes 74, 126–141.

Schaal, D.W., Branch, M.N., 1988 Responding of pigeons under variable-interval schedules of unsignaled, briefly signaled, and completely signaled delays to reinforcement J Exp Anal Behav 50, 33–54.

Schaal, D.W., Branch, M.N., 1990 Responding of pigeons under variable-interval schedules of signaled-delayed reinforcement: effects of delay-signal duration J Exp Anal Behav 53, 103–121.

Schwartz, B., 1976 Positive and negative conditioned suppression in the pigeon: effects of the locus and modality of the CS Learn Motiv 7, 86–100.

Shettleworth, S.J., Plowright, C., 1989 Time horizons of pigeons on a two-armed bandit Anim Behav 37, 610–623.

Singh, S.P., Sutton, R.S., 1996 Reinforcement learning with replacing eligibility traces Mach Learn 22, 123–158.

Sutton, R.S., Barto, A.G., 1990 Time-derivative models of Pavlovian reinforcement In: Gabriel, M., Moore, J (Eds.), Learning and Computational Neuroscience: Foun-dations of Adaptive Networks MIT Press, Cambridge, MA.

Thomas, G.V., Lieberman, D.A., McIntosh, D.C., Ronaldson, P., 1983 The role of mark-ing when reward is delayed J Exp Psychol Anim Behav Process 9, 401–411 Timberlake, W., Lucas, G.A., 1990 Behavior systems and learning: from misbehavior

to general principles In: Klein, S.B., Mowrer, R.R (Eds.), Contemporary Learning Theories: Instrumental Conditioning Theory and the Impact of Constraints on Learning Erlbaum, Hillsdale, NJ.

Uttal, W.R., 2000 The War Between Mentalism and Behaviorism: On the Accessibil-ity of Mental Processes Lawrence Erlbaum Associates, Inc., Mahwah, NJ Uttal, W.R., 2008 Time, Space, and Number in Physics and Psychology Sloan Pub-lishing, Cornwall-on-Hudson, NY.

Wilkenfield, J., Nickel, M., Blakely, E., Poling, A., 1992 Acquisition of lever-press responding in rats with delayed reinforcement: a comparison of three proce-dures J Exp Anal Behav 58, 431–443.

Williams, B.A., 1999 Associative competition in operant conditioning: blocking the response–reinforcer association Psychon Bull Rev 6, 618–623.

Williams, B.A., Dunn, R., 1991 Preference for conditioned reinforcement J Exp Anal Behav 55, 37–46.

Định dạng
Số trang	7
Dung lượng	279,57 KB