It premises exponentially decaying traces related to the richness of the environment, and conditioned reinforcement as the average of such traces over the extended stimulus, yielding an
Trang 1Contents lists available atScienceDirect
Behavioural Processes
j o u r n a l h o m e p a g e :w w w e l s e v i e r c o m / l o c a t e / b e h a v p r o c
Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid
Peter R Killeen∗
Department of Psychology, Box 1104, McAllister St., Arizona State University, Tempe, AZ 85287-1104, United States
a r t i c l e i n f o
Article history:
Received 15 August 2010
Received in revised form
24 December 2010
Accepted 27 December 2010
Keywords:
Delay of reinforcement gradients
Discounting
Forced choice paradigms
Magnitude effect
Matching paradigms
Reinforcement learning
Trace decay gradients
a b s t r a c t
Behavior such as depression of a lever or perception of a stimulus may be strengthened by consequent behaviorally significant events (BSEs), such as reinforcers This is the Law of Effect As time passes since its emission, the ability for the behavior to be reinforced decreases This is trace decay It is upon decayed traces that subsequent BSEs operate If the trace comes from a response, it constitutes primary rein-forcement; if from perception of an extended stimulus, it is classical conditioning This paper develops simple models of these processes It premises exponentially decaying traces related to the richness of the environment, and conditioned reinforcement as the average of such traces over the extended stimulus, yielding an almost-hyperbolic function of duration The models account for some data, and reinforce the theories of other analysts by providing a sufficient account of the provenance of these effects It leads to
a linear relation between sooner and later isopreference delays whose slope depends on sensitivity to reinforcement, and intercept on that and the steepness of the delay gradient Unlike human prospective judgments, all control is vested in either primary or secondary reinforcement processes; therefore the use of the term discounting, appropriate for humans, may be less descriptive of the behavior of nonverbal organisms
© 2011 Elsevier B.V All rights reserved
1 Introduction
Pigeons cannot reliably count above 3 (Brannon et al.,
2001; Nickerson, 2009; Uttal, 2008), have short time-horizons
(Shettleworth and Plowright, 1989), may be stuck in time (Roberts
and Feeney, 2009), do not ask for the answers to the questions they
are about to be asked (Roberts et al., 2009), and fail to negotiate an
amount of reinforcement commensurate with the work that they
are about to undertake (Reilly et al., 2011) How do such simple
creatures discount future payoffs as a function of their delay? It is
the thesis of this paper that they do not That the orderly data in
such studies is the simple result of the dilution of the conditioned
reinforcers which support and guide that choice, as a function of
the delay to the outcome that they signal
Classic and generally accepted concepts of causality preclude
events from acting backward in time Then what sense do we make
ofFig 1, a familiar rendition of the control exerted by delayed
rein-forcers? How do the animals know what is coming? Only three
accounts come to mind (a) Precognition But causality rules that
out (b) It is memory of a past choice that makes contact with
rein-forcement; the figure should be reversed Or (c) the animals have
learned what leads to what There follows an extended argument
∗ Tel.: +1 480 967 0560; fax: +1 480 965 8544.
E-mail address: killeen@asu.edu
that (b) and (c) are both true, and that in novel contexts, (b) typically leads to (c)
When in the course of an animals’ behavior a behaviorally sig-nificant event (BSE; or phylogenetically important event (Baum,
2005); or more familiarly, incentive, reinforcer, or unconditioned stimulus) occurs, there immediately arises the question of whence
In computer science this is the assignment of credit problem If the organism, or software, takes into account events in the last instant, there are r potential causes for the BSE, where r is a mea-sure of the richness of context An additional r events occurred in the prior, penultimate instant The combination of any one of these with those in the ultimate instant could have been the causal chain that led to the BSE: r2 sequences in toto Extending the account further, to the antepenultimate instant, raises the pool to r3 Con-tinue this process back and the candidate pool of sequences grows
as rn, where n is the depth of query If each of these instants of apprehension lasts ı s, then n = d/ı, and the candidate path grows
as rd/ı, where d is the delay between event and consequence In the continuous limit, this equals ed/, where is the time constant
of the traces – the inverse of the continuous limit of the richness parameter r This means that the gradients get steeper in rich envi-ronments: = 1/r It follows that any one causal path is eligible for 1/ed/ of the credit for reinforcement, everything being equal
Of course everything is not equal: the priors on some events are higher than on others, either because of their phylogenetic rele-vance, or their memorability, which may be enhanced by marking
0376-6357/$ – see front matter © 2011 Elsevier B.V All rights reserved.
doi: 10.1016/j.beproc.2010.12.016
Trang 2Fig 1 Traditional delay of reinforcement gradients to two outcomes of different
incentive value.
Fig 2 Reverse Fig 1 to see these, variously named trace decay, decay of
eligibil-ity for causal status, and decay of memory gradients Gradients are shown for two
responses of different memorability.
their occurrence with salient stimuli Allowing for such bias,
rep-resented by the parameter c, we would expect the causal impact to
decrease with time prior as:
s′
which is the point association1 of an event at d seconds remove
from the BSE, as seen inFig 2
This story for why associability between an event and a
subse-quent BSE may decay exponentially, retold fromJohansen et al
(2009)andKilleen (2005), has some empirical support (Killeen,
2001b; Escobar and Bruner, 2007) Eligibility traces play a central
role in AI reinforcement learning (Singh and Sutton, 1996) Classic
models such as Sutton and Barto’s posit a geometrically decreasing
representation of events similar to that developed here, and work to
reconcile details of instrumental and Pavlovian conditioning with
various instantiations of such traces (Sutton and Barto, 1990; Niv
et al., 2002) Alternatively, it is possible to simply posit exponential
or hyperbolic decay of memory of the stimulus, and also that these
traces may or may not vary with the richness of the environment
This has been the productive tactic of most analysts of delay
dis-counting If this disposition is good enough for you, skip the next 3
pages
What is the purported mechanism? As developed here it
is one of stimulus competition, with richer environments and
greater interludes providing more opportunities for interference
A stimulus-sampling model of acquisition (Bower, 1994; Estes and
Suppes, 1974; Neimark and Estes, 1967; Estes, 1950) provides
the basis of a model of acquisition in the face of such
contingen-1 If the duration of a response is ı s, then the impact of reinforcement on it is
given by the integral of Eq (1) from d to d + ı For brief events such as responses,
this essentially equals ı times the right-hand side of Eq (1) For responses of similar
durations this coefficient is absorbed by c.
Fig 3 Eligibility traces of a response at increasing temporal removes from a reinforcer At greater removes, the right tails have lower associability with rein-forcement, as indicated by their height where they intersect the right ordinate Graphing that height above the temporal distance gives the dashed curve, the delay
of reinforcement gradient.
cies degraded by delay and distraction (Killeen, 2001a) It is not repeated here Another way to think of Eq.(1)is as a measure of the signal-to-noise ratio of a delay contingency In the case c = 1/,
Eq.(1)describes a probability distribution, so that identification of one point from the distribution reduces candidate uncertainty by log2(e) bits
What is the relation between eligibility traces and the delay of reinforcement gradient?Fig 3shows 7 trace gradients for events occurring more and more remote from the BSE The most proximate occurs at the moment of reinforcement, and is visible only as a dot in the upper right corner; it receives the full credit for which it might
be eligible An event occurring 1 time step earlier has an impact diluted by about 30% by the time of reinforcement, as inferred from where its trace cuts the origin, the zero delay axis at the right of the graph Draw this measure of eligibility, 0.7, out 1 unit from the right frame, as shown by the arrow, and connect it to the full measure
in the corner by a dashed line The event 2 steps back decays by about 50% at the time of reinforcement; draw a line from there extending to the left at 2, and continue the dashed line to it When bored of this construction, stop to consider the shape of the delay of reinforcement gradient – the dashed line When smoothed, it will have exactly the same shape as any of the decay traces, but will be reflected about its new origin at 0
The distinction between these two representations, one of pro-cess and the other of product, is important AsFig 2makes clear, what is present at the time of reinforcement is a decayed trace of a response Differential reinforcer magnitude can have no retroactive effect on the shape or elevation of those traces Reinforcers of dif-ferent magnitudes do not change the decay gradients, but rather act differentially on their tails: a larger reinforcer may be more effec-tive at leveraging the same residual memory than a small one But those tails may be of different elevation – and thus differentially able to receive the effect of the reinforcement – because they are more or less memorable (reflected in c) or because they occur in a richer or bleaker environment (reflected in )
2 Hyperbolic dilemmas How can gradients be exponential when everyone says that they are hyperbolic? The curves inFig 1do not cross, whereas most representations of discounted future events of differing value do These three figures address the associability of a discrete event at a remove of t from reinforcement They do not address situations in which that event leads to an immediate change of state signaling a deferred outcome A signal of change of state marks the precipitat-ing event by immediately sprecipitat-inglprecipitat-ing it out as the precursor of a better (or possibly worse) state of affairs Consider a response that causes the onset of a stimulus, and after a delay of d, a BSE Assume that each of the temporal elements of the stimulus receives associations
as given by Eq.(1), and that these are otherwise equivalent in time
Trang 3Fig 4 Disks: the decreasing efficacy of a primary reinforcer as a function of the delay
between it and a response The continuous curve is given by Eq (1) ; the dashed curve
by Eq (3) Squares: the decreasing efficacy of a conditioned reinforcer as a function
of the maximum delay it signals The continuous curve is given by Eq (2) ; the dashed
curve superimposed on it by Eq (3) The data are from Richards (1981)
(that is, that the parameters of Eq.(1)do not change over the delay)
In the case that each element of the stimulus is highly generalizable
with the next, these associations add linearly, giving a total
associa-bility equal tod
0 ce−t/dt This integral assumes that the temporal
elements dt make linearly independent contributions to the total
association Because one element of the stimulus is, per hypothii,
indiscriminable from the next, any one element – in particular the
one just following a response – has an average associability given
by:
¯sd=
d
0
ce−t/dt
d
0
dt
¯sd=c(1 − e−d/)
d
(2)
Eq (2) is not discriminable from the inverse linear relation
known as hyperbolic (Killeen, 2001a).Fig 4demonstrates this
sim-ilarity by fitting both Eq.(2), and the hyperbola
shyp= c
to data from Richards (1981) that describe the effects of
sig-naled delayed reinforcement on the average response rates of four
pigeons The curves through the squares superimpose This makes
sense, as Eq.(3)is a series approximation2to Eq.(2)
Experienced laboratory animals can tell the difference between
the start of a long delay and the start of a short one; they are
sen-2
e −d/= 1
e d/≈
1
1 + d/ + · · ·
∴
¯sd= c(1 − e − d/ )
c
d
1 − 1
1 + d/
¯sd≈ c
1 + d/
The average absolute deviation between Eqs (2) and (3) over the range from
0.99 to 0.04 is 0.064; however letting the time constant in either equation vary
from its value in the other reduces this deviation to 0.023, within experimental
error.
The exponential term may also be approximated with the more standard Maclaurin
series: e −d/=1 − d/ + (d/) 2 /2! − , but the first approximation is everywhere
more accurate The latter approximation deviates from Eq (2) by 4.6 (against 0.06),
reduced to 0.33 (against 0.02) by refitting .
The limit of Eq (2) as d goes to 0 is c, as may be demonstrated using l‘Hôpital’s rule.
Fig 5 The decreasing efficacy of a reinforcer in establishing a new response as a function of the delay between it and a response The continuous curve is exponential, the dashed curve hyperbolic Error bars are the standard errors of the means The data are from Wilkenfield et al (1992)
sitive to time and delay (Moore and Fantino, 1975) The use of Eq (2)requires that, facing start of a long delay to food and a stimulus which – in the best of times – is contiguous with food, control by the stimulus dominates that by time Animals, in other words, are optimists: their behavior is primarily under the control of the most hopeful stimuli rather than some weighted average of predictive stimuli There is good evidence that this is often the case (Horney and Fantino, 1984; Sanabria and Killeen, 2007; Jenkins and Boakes,
1973)
Also shown in Fig 4is the decay trace for unsignaled rein-forcement Under the hypothesis of the prior section, it is given
by Eq.(1), an exponential function, shown as the continuous curve passing near the disks, showing response rates for unsignaled (non-resetting) delays Also shown is the hyperbola, Eq.(3), which apparently gives an inferior fit to these data – although this data-base is too limited to make secure generalizations For unsignaled delayed reinforcement, at least in this case, the exponential gra-dients are, as predicted, competitive with the more traditional hyperbolic gradients.Fig 4illustrates Lattal’s generalization that
“The unsignaled delay gradient is characterized by [generally] lower response rates and a steeper slope than the gradient obtained with otherwise equivalent signaled delays” (Lattal, 2010) WhereasFig 4usefully compares the effects of signaled and unsignaled delays, because the unsignaled delays were non-resetting, the actually experienced delays were variable and less,
by an unspecified amount, than the abcissae A better test of the sufficiency of Eq.(1)comes fromWilkenfield et al (1992), using resetting delays, where the abcissae provide accurate representa-tions of the experienced delays These investigators reported the response rates during acquisition of lever pressing from four groups
of rats, nine in each group Their data from the first 100 min of acquisition are shown inFig 5 Again, the exponential provides a plausible model
The simple hyperbolic model has been shown adequate for most discount functions for non-verbal animals (Green and Myerson, 2004; Ong and White, 2004; Green et al., 2004) But unlike its cousin the hyperbola, which is ad hoc, Eq.(2)has some theoretical moti-vation: it predicts radical changes in preference as a function of the nature and continuity of the stimuli that bridge the delay between response and BSE, and holds out the promise for quantifying those effects It is consistent with the important role of conditioned rein-forcers in preference for delayed outcomes (Williams and Dunn,
1991), and provides a useful refinement to a unified theory of choice (Killeen and Fantino, 1990) In the latter theory, and its precedent (Killeen, 1982a,b), the control by a delayed reinforcer was mod-eled as the sum of both the primary (i.e., point association with the
Trang 4the association of streams of responses with reinforcement, is the
heart of the model of coupling in my theory of schedule effects,
MPR (Killeen, 1994)
The presence of stimuli occurring between a response and
BSE may not always be beneficial to conditioning the response
Brief stimuli occurring immediately after a response (marking it)
may make the response more memorable when the BSE occurs
(Lieberman et al., 1979; Thomas et al., 1983) – perhaps by
increas-ing the value of c Alternatively, such stimuli may initiate adjunctive
behavior that serves as an extended conditioned stimulus (CS)
(Schaal and Branch, 1988) Conversely, brief stimuli occurring just
before reinforcement may block control by the response–reinforcer
association (Pearce and Hall, 1978).Williams (1999)andReed and
Doughty (2005)demonstrated the power of both effects in the same
experiments Whether the effects of primary and secondary
rein-forcement add or interfere depends on the correlation of each of the
contingencies with the behavior measured by the experimenter:
a CS whose presentation is not contingent on behavior will only
adventitiously strengthen the target response, and, depending on
temporal variables, is as likely to compete with it; furthermore,
one which signals non-contingent reinforcement will compete
with concurrent instrumental responses (Miczek and Grossman,
1971) A CS presented on the instrumental operandum can enhance
response rate, whereas one presented on a different operandum
can compete with it (Schwartz, 1976) As the duration of a
mark-ing stimulus extends into the delay interval, integration of Eq.(2)
between its endpoints predicts a positively accelerating
effective-ness of the stimulus.Schaal and Branch (1990)found the predicted
increase, but it was negatively accelerated for 2 of the 3 pigeons
The association of a CS or response with the measured behavior
will also depend on the modality of the CS, the modality of the
response (Timberlake and Lucas, 1990), and the contingencies that
make the correlation tight or weak (Killeen and Bizo, 1998) For the
present argument, these correlations of response and CS with the
experimenter’s dependent variable are carried by the constant c
3 The effects of delay on choice
To apply Eq.(2)to experiments in which an animal is choosing
between delayed reinforcers of different magnitudes (a) requires
a scale that maps amount into reinforcing effectiveness Perhaps
the simplest “utility” function for reinforcement amount is the
power function, which is the form assumed in the generalized
matching law (Rachlin, 1971; Killeen, 1972; Baum, 1979) It has
the advantage of simplicity, and fits most of the available data
over its limited range A disadvantage is that it has the
effective-ness of reinforcement growing without bound as the amount is
increased, which is implausible.Rachlinhas derived other forms for
utility from first principles (1992); his logarithmic, andmy (1985)
exponential-integral can also accommodate data, as can Bradshaw
and associates’ hyperbolic discounting of amount (Bezzina et al.,
2007) However, the equations look simpler if we adopt the
for-malism of the generalized matching law in which the reinforcing
power of amount is the power function, u(a) = a˛ Then the
asso-ciative strength of a response immediately followed by a stimulus
change, and d later a BSE of physical magnitude a, is the product
of the impact of the BSE, a˛, on the sum of the primary s′
d and secondary ¯sdeffects Assuming for parsimony that in the cases
anal-ysed the relative salience of stimulus elements and responses are
comparable, then cprimary≈csecondary= c, and:
sd,a=a˛c
e−d/+(1 − e−d/)
d
Fig 6 Data from an experiment by Green and associates (2004) in which the amount delivered to pigeons immediately (1/2 s delay) was adjusted to indifference with that given after the delay noted on the x axis The parameter is the magnitude of the delayed reinforcer The curves are drawn by Eqs (5) and (6)
3.1 Methods of adjustment
Psychophysical paradigms in which variables are adjusted to cause indifference in preferences or other judgments – “Matching paradigms” (Farell and Pelli, 1999) – are more secure of interpreta-tion than those involving a psychological scale, such as one of value (Hand, 2004; Uttal, 2000) Their units are physical measurements, and they refer to a unique psychological point, that of equivalence This may be determined whether the underlying scale is interval, ordinal, or even nominal
How great must an amount a1be to balance a different amount
a2at a different delay? Set
a1˛ce−d1/+(1 − e−d1/)
d1
=a2˛ce−d2/+(1 − e−d2/)
d2
and solve for a1:
a1
a2
=
e−d2/+(1 − e−d2/)/d2
e−d1/+(1 − e−d1/)/d1
1/˛
(5)
Eq.(5)gives the relative equivalent value of amount a2delayed
d2, compared to an alternative delayed d1 Typically, d1is “imme-diate” – that is, around 1/2 s, and then Eq.(5)gives the relative immediate equivalent amount With d2> d1, this ratio will be less than 1, indicating that a smaller immediate amount, relative to a2, suffices to balance the latter at a remove of d2 Note that neither amount appears in the right hand side; no magnitude effect is pre-dicted: as long as the ratio of delays is the same, the predictions are the same when both amounts are multiplied by a constant In gen-eral, no magnitude effect is found in delay discounting experiments with non-human animals (Green et al., 2004; Ong and White, 2004) Fig 6shows the course of Eq.(5), with ˛ = 1.26 and = 2.12 s, passing near the average data from four pigeons in an experiment where the amount delivered after 1/2 s was adjusted to maintain indiffer-ence between it and a larger amount (given by the parameter in figure) delivered at a delay
The primary and conditioned reinforcing effects are highly cor-related; Eq.(5)may be simplified by deleting the primary influence
of the reinforcers on the choice responses, to yield:
a1
a2
=
d1(1 − e−d2/)
d2(1 − e−d1/)
1/˛
which draws the continuous curve through the data inFig 6 But the primary and secondary effects may be dissociated, and when they are, alternatives with both are preferred to those with just primary reinforcement (Marcattilio and Richards, 1981; Lattal, 1984) The hyperbolic approximation to Eq.(6)provides a decent fit to these
Trang 5Fig 7 Data from experiment 2 of Fox et al (2008) studying relative choice of 3
pellets delayed vs 1 immediate in two strains of rats, with curves drawn by Eq (8).
Here “immediate” is set at 1/2 s.
data as well, but falls noticeably farther from their average than do
Eqs.(5) and (6)
In some matching experiments, the delay to one outcome is
adjusted, rather than the amount Eq.(7)yields no simple
predic-tion, but invoking the series expansion of the exponential term2
that was used in going from Eqs.(2)and(3):
¯sd
i ,ai≈
a˛
ic
1 + di/,
leads to the simple linear relation of Eq.(7)
d2=d1
a2
a1
˛
+a2
a1
˛
Operations that increase the sensitivity to reinforcement
(increase ˛) or flatten the gradient (increase ) will increase the
indifference point, d2 The provenance of the effect can be
deter-mined by manipulating d1, as the former will increase both slope
and intercept, and the latter only intercept Some drugs, such as
stimulants, may decrease ˛ while increasing (Maguire et al., 2009;
Pitts and Febbo, 2004), and their results will thus vary as a
func-tion of the balance between the two, largely determined by the
value of d1 A linear equation such as(7), based on multiplicative
hyperbolic functions of amount and delay, was proposed and
val-idated byMazur (2001), and independently by Bradshaw’s group
(Ho et al., 1999; Bezzina et al., 2007; da Costa Araújo et al., 2009)
In Bradshaw’s model, as in Eq.(7), the slope depends on relative
payoffs regulated by the amount amplifier parameter ˛, and the
intercept on a multiplicative function of that and delay
sensitiv-ity Their model has also been applied to human delay discounting
(Hinvest and Anderson, 2010; Liang et al., 2010)
3.2 Methods of forced choice
An alternative psychophysical procedure involves the
measure-ment of the degree of preference between two fixed alternatives, or
the frequency of choosing one over the other Eq.(4)may be
rear-ranged to predict the outcome of choice experiments in which the
delays and outcomes are invariant The relative associative strength
of the alternatives is:
sd1,a1
sd1,a1+sd2,a2
=
1 +a2
a1
˛e−d2/+(1 − e−d2/)/d2
e−d1/+(1 − e−d1/)/d1
−1
(8)
In the case of unbiased choice there are two free parameters,
the rate of diminishing marginal utility for larger amounts, ˛, and
the time constant of the memory trace, Note that amounts again
appear as a ratio, indicating scale invariance: there is no magnitude
effect.Fig 7shows this model follows a path similar to the data of
Fox et al (2008), who asked whether rat models of ADHD (SHRs) would show steeper delay gradients than control (WKY) rats They did Other investigators (Adriani et al., 2003) did not find steeper gradients for SHR, but observed very large individual differences As noted byOrdu ˜na et al (2007)the main effect found by Fox and asso-ciates may be due to idiosyncrasies of their control rats (Sagvolden
et al., 2009)
4 Discussion Prospective judgments of equivalent amounts by humans, typ-ical in the delay-discounting literature, require computations that are different in kind from those of paradigms in which real delays are conditioned to discriminative stimuli Humans can be instructed to contemplate the desirability of ten thousand dol-lars in ten years, and to stipulate how little they would settle for one week hence in lieu of it The performance entails a scale
of future time, the value of an outcome deferred by that delay, and concatenation of the non-linear time-scale with a non-linear amount scale, from which a variety of results are imaginable (Killeen, 2009; Rachlin, 2006) Little wonder that there are dif-ferences in covering models The only way to so instruct other animals is to expose them to such realities repeatedly The asser-tion in the opening of this paper that the future cannot act on non-verbal animals was meant to emphasize this difference: on the one hand verbally presented unexperienced hypotheticals that can control human responses, and on the other the conditioning of behavior reinforced by the presentation of conditioned reinforcers signaling real, experienced, delays, that controls pigeon and rat behavior
This paper should be read as a grounding of hyperbolic models of delay discounting, not a critique of them It presented a few ideas First, it is observed thatFig 1is not a model of a process It is a summary of some other kind of process, such as the one proposed in Fig 3 The distinction is important, as thinking ofFig 1as a process can be misleading I am not alone in this concern:
In this [Fig 1] view, reinforcers reach back in time to effect this response in the presence of the remembered stimulus
As a model of how an animal adapts to, or learns about, situations with stimulus–behavior delays and response–reinforcer delays, the model has the problem of reinforcer effects spreading back-ward in time Physiologically, the process cannot act in this way, and physiology must require that the memory of an event flows forward in time, rather than the reinforcer effect flowing back-wards But the response-centric view is the dominant view in the study of delayed reinforcers and of self control
A simpler, much more likely, and physiologically consistent conceptualization of the adaptation to these delays is shown
in [Fig 2] In this view, at the point at which a reinforcer is delivered, it is the conjunction of the memories of both the stimulus and the response at the time of reinforcer delivery that is “strengthened” and, I presume, remembered and subse-quently accessed and used This approach suggests a different, and more parsimonious, mechanism for learning and activ-ity that is squarely based on memory When reinforcers are delayed, it is the residual memory of responses times the value
of the reinforcers that will describe the effects of reinforcer delay
on behavior When responses are delayed following stimuli, it
is the residual memory of the stimulus times the value of the reinforcer that will describe the stimulus–reinforcer conjunc-tion, providing a role for stimulus–reinforcer relations (as in momentum theory) (Davison, 2006)
The present paper constitutes simply the endorsement of the first paragraph and one realization of the second paragraph
Trang 6colleague felt that that such grounding is unnecessary, as the
hyper-bola is justified by its ubiquitous accuracy in characterizing the
‘discounting’ data, that the rationale supporting hyperbolic
dis-counting does not rely on the validity or even plausibility of any
internal mechanism Rather that it relies on its predictive ability
on its own level, the overt behavior of the whole organism, and
its applicability in the real world So, why all the above talk about
associations, decaying traces, and assignment of credit? Because, I
plead, it puts some meat on the bones, holds out a hand of
trans-lation to AI reinforcement theorists, and turns ‘round a figure got
backward But, chacon a’ son gỏt
A third idea expressed in this paper is that simple processes of
decay (Eq.(1)) and average decay (Eq.(2)) represent behavioral
processes that are void of cognitive representations That is not the
case for human delay discounting, as the vast majority (though not
all) of the data from it involves hypothetical amounts and delays
that are communicated verbally, and have never and will never be
experienced by the individual The present treatment is thoroughly
behavioral The use of mathematics to represent the conditioning
processes has been misunderstood by some colleagues as asserting
that the animals must perform such computations That is true in
the same sense that a rope suspended at two points evaluates a
catenary equation The calculations of pigeon and rope, such as they
are, are embodied, not computed; the mathematical representation
derives from the scientist, not from the thing he or she uses it to
describe
The final idea is the importance of the distinction between
dif-ferent mensuration paradigms The matching paradigm, some of
whose results are displayed inFig 6, is different in kind than the
forced-choice/preference paradigm, some of whose results are
dis-played inFig 7 Why should an animal who prefers alternative A to
alternative B not always choose A; but rather choose it, say, only 70%
of the time? It does not suffice to say “because it matches”, which
offers a result in the guise of an explanation To decline the thing
you prefer, you must have balancing considerations, such as cost, or
novelty; or be confused; or be irrational.Mazur (2010)has shown
that in the simple forced choice paradigm non-exclusive
prefer-ence may be due to experimental designs that confuse the animal
That possibility is exacerbated in the concurrent chain version of
the forced-choice paradigm Sub-exclusive preference there occurs
not because the other 30% of the time the animal prefers B (how
often would you choose $30 over $70, once the pleasure of
thwart-ing the experimenter has paled?) – but because the contthwart-ingencies
of reinforcement have made the probability of getting B sufficiently
greater at that point in time, primed and awaiting collection, with
the preferred A never any closer3
The way in which probabilities on concurrent schedules bend
preference from rational exclusivity toward matching was nicely
demonstrated byCrowley and Donahoe (2004) But these
evolv-ing probabilities are typically treated as externals, measured (e.g
Boutros et al., 2009; Davison and Baum, 2003) analysed (MacDonall,
2000, 2005) and modeled (e.g.,Grace et al., 2006) in their own right
Unfortunately, that research seldom changes the interpretation of
relative rates as prima facie measures of preference The
dynam-ically evolving probabilities that concurrent VIs schedule are an
intrinsic part of the package the animal must dynamically balance
– not a neutral tool to measure it When the negative feedback
inherent in those schedules is eliminated in adjustment paradigms
where confusion is minimized, animals just about always choose
3 On random interval VI schedules with mean m, the probability of reinforcement
on the same key one second after the last peck is always 1/m, whereas on the other
it increases toward 1 as 1 − e − t/m , with t the time since the last changeover.
Magnitude, delay, and probability of reinforcement interact to control choice in concurrent schedules (Elliffe et al., 2008) Some interaction is allowed by Eq.(8) due to its many nonlinearities, giving more weight to delay differentials as both delays increase ButIto and Asaki (1982)found substantial monotonic increases
in rats’ preference for 3 vs 1 pellets as the equal delays to their receipt increased.Ong and White (2004)noted other instances of this effect, and attributed it to increased sensitivity to reinforcer amount when reinforcers are delayed But it is not clear how that is anything other than a magnitude effect; and thus at odds with the results from matching (adjustment) paradigms
Whether due to discrimination failure in simple forced choice,
or negative feedback contingencies in concurrent chain schedules, non-exclusive preferences are an uncertain metric of what animals value The application of Eqs.(5)–(7)for matching paradigms is therefore offered with more confidence than Eq.(8)for concurrent-chain interval schedules, which require a more complex model, such as that ofChristensen and Grace (2010)
Acknowledgements
I thank Tim Cheung and Ryan Brackney for comments, Robert Kessel for insisting on mathematical precision, Tony Nevin for insisting on conceptual clarity as well; and to all for helping to show me how to achieve those desiderata The remaining signif-icant deviations are mine
References Adriani, W., Caprioli, A., Granstrem, O., Carli, M., Laviola, G., 2003 The spontaneously hypertensive-rat as an animal model of ADHD: evidence for impulsive and non-impulsive subpopulations Neurosci Biobehav Rev 27, 639–651.
Baum, W.M., 1979 Matching, undermatching, and overmatching in studies of choice.
J Exp Anal Behav 32, 269–281.
Baum, W.M., 2005 Understanding Behaviorism: Behavior, Culture, and Evolution Blackwell, Malden, MA, p 312.
Bezzina, G., Cheung, T.H.C., Asgari, K., Hampson, C.L., Body, S., Bradshaw, C.M., Szabadi, E., Deakin, J.F.W., Anderson, I.M., 2007 Effects of quinolinic acid-induced lesions of the nucleus accumbens core on inter-temporal choice: a quantitative analysis Psychopharmacology 195, 71–84.
Boutros, N., Elliffe, D., Davison, M., 2009 Time versus response indices affect con-clusions about preference pulses Behav Processes 84, 450–454.
Bower, G.H., 1994 A turning point in mathematical learning theory Psychol Rev.
101, 290–300.
Brannon, E.M., Wusthoff, C.J., Gallistel, C.R., Gibbon, J., 2001 Numerical subtraction
in the pigeon: evidence for a linear subjective number scale Psychol Sci 12, 238–243.
Christensen, D.R., Grace, R.G., 2010 A decision model for steady-state choice in concurrent chains J Exp Anal Behav 94, 227–240.
Crowley, M.A., Donahoe, J.W., 2004 Matching: its acquisition and generalization J Exp Anal Behav 82, 143–159.
da Costa Arẳjo, S., Body, S., Hampson, C.L., Langley, R.W., Deakin, J.F.W., Ander-son, I.M., Bradshaw, C.M., Szabadi, E., 2009 Effects of lesions of the nucleus accumbens core on inter-temporal choice: further observations with an adjusting-delay procedure Behav Brain Res 202, 272–277.
Davison, M., 2006 Behavior-centric versus reinforcer-centric descriptions of behav-ior PsyCrit 12 (November), 1–3.
Davison, M., Baum, W.M., 2003 Every reinforcer counts: reinforcer magnitude and local preference J Exp Anal Behav 80, 95–129.
Elliffe, D., Davison, M., Landon, J., 2008 Relative reinforcer rates and magnitudes do not control concurrent choice independently J Exp Anal Behav 90, 169–185 Escobar, R., Bruner, C.A., 2007 Response induction during the acquisition and main-tenance of lever pressing with delayed reinforcement J Exp Anal Behav 88, 29–49.
Estes, W.K., 1950 Toward a statistical theory of learning Psychol Rev 57, 94–107 Estes, W.K., Suppes, P., 1974 Foundations of stimulus sampling theory In: Contem-porary Developments in Mathematical Psychology.
Farell, B., Pelli, D.G., 1999 Psychophysical methods, or how to measure a threshold and why In: Carpenter, R.H.S., Robson, J.G (Eds.), Vision Research: A Practical Guide to Laboratory Methods Oxford Univ Press, New York.
Fox, A.T., Hand, D.J., Reilly, M.P., 2008 Impulsive choice in a rodent model of attention-deficit/hyperactivity disorder Behav Brain Res 187, 146–152 Grace, R.C., Berg, M.E., Kyonka, E.G.E., 2006 Choice and timing in concurrent chains: effects of initial-link duration Behav Processes 71, 188–200.
Trang 7Green, L., Myerson, J., 2004 A discounting framework for choice with delayed and
probabilistic rewards Psychol Bull 130, 769–792.
Green, L., Myerson, J., Holt, D.D., Slevin, J.R., Estle, S.J., 2004 Discounting of delayed
food rewards in pigeons and rats: is there a magnitude effect? J Exp Anal Behav.
81, 39–50.
Hand, D.J., 2004 Measurement Theory and Practice Oxford University Press, Inc.,
New York, p 320.
Hinvest, N.S., Anderson, I.M., 2010 The effects of real versus hypothetical reward on
delay and probability discounting Q J Exp Psychol 63, 1072–1084.
Ho, M.Y., Mobini, S., Chiang, T.J., Bradshaw, C.M., Szabadi, E., 1999 Theory and
method in the quantitative analysis of “impulsive choice” behaviour:
implica-tions for psychopharmacology Psychopharmacology 146, 362–372.
Horney, J., Fantino, E., 1984 Choice for conditioned reinforcers in the signaled
absence of primary reinforcement J Exp Anal Behav 41, 193–201.
Ito, M., Asaki, K., 1982 Choice behavior of rats in a concurrent-chains schedule:
amount and delay of reinforcement J Exp Anal Behav 37, 383–392.
Jenkins, H.M., Boakes, R.A., 1973 Observing stimulus sources that signal food or no
food J Exp Anal Behav 20, 197–207.
Johansen, E.B., Killeen, P.R., Russell, V.A., Tripp, G., Wickens, J.R., Tannock, R.,
Williams, J., Sagvolden, T., 2009 Origins of altered reinforcement effects in
ADHD Behav Brain Funct 5, 7.
Killeen, P.R., 1972 The matching law J Exp Anal Behav 17, 489–495.
Killeen, P.R., 1982a Incentive theory In: Bernstein, D.J (Ed.), Nebraska Symposium
on Motivation, vol 1981 Response Structure and Organization, University of
Nebraska Press, Lincoln.
Killeen, P.R., 1982b Incentive theory II: models for choice J Exp Anal Behav 38,
217–232.
Killeen, P.R., 1985 Incentive theory IV: magnitude of reward J Exp Anal Behav 43,
407–417.
Killeen, P.R., 1994 Mathematical principles of reinforcement Behav Brain Sci 17,
105–172.
Killeen, P.R., 2001a Modeling games from the 20th century Behav Processes 54,
33–52.
Killeen, P.R., 2001b Writing and overwriting short-term memory Psychon Bull Rev.
8, 18–43.
Killeen, P.R., 2005 Gradus ad parnassum: ascending strength gradients or descending
memory traces? Behav Brain Sci 28, 432–434.
Killeen, P.R., 2009 An additive-utility model of delay discounting Psychol Rev 116,
602–619.
Killeen, P.R., Bizo, L.A., 1998 The mechanics of reinforcement Psychon Bull Rev,
221–238.
Killeen, P.R., Fantino, E., 1990 A unified theory of choice J Exp Anal Behav 53,
189–200.
Lattal, K.A., 1984 Signal functions in delayed reinforcement J Exp Anal Behav 42,
239–253.
Lattal, K.A., 2010 Delayed reinforcement of operant behavior J Exp Anal Behav.
93, 129–139.
Liang, C.H., Ho, M.Y., Yang, Y.Y., Tsai, C.T., 2010 Testing the applicability of a
multiplicative hyperbolic model of inter-temporal and risky choice in human
volunteers Chin J Psychol 52, 189–204.
Lieberman, D.A., McIntosh, D.C., Thomas, G.V., 1979 Learning when reward is
delayed: a marking hypothesis J Exp Psychol Anim Behav Process 5, 224–242.
MacDonall, J.S., 2000 Synthesizing concurrent interval performances J Exp Anal.
Behav 74, 189–206.
MacDonall, J.S., 2005 Earning and obtaining reinforcers under concurrent interval
scheduling J Exp Anal Behav 84, 167–183.
Maguire, D.R., Rodewald, A.M., Hughes, C.E., Pitts, R.C., 2009 Rapid acquisition of
preference in concurrent schedules: effects of D-amphetamine on sensitivity to
reinforcement amount Behav Processes 81, 238–243.
Marcattilio, A.J.M., Richards, R.W., 1981 Preference for signaled versus unsignaled
reinforcement delay in concurrent-chain schedules J Exp Anal Behav 36,
221–229.
Mazur, J.E., 2001 Hyperbolic value addition and general models of animal choice.
Psychol Rev 108, 96–112.
Mazur, J.E., 2010 Distributed versus exclusive preference in discrete-trial choice J.
Exp Psychol Anim Behav Process 36, 321–333.
Miczek, K.A., Grossman, S.P., 1971 Positive conditioned suppression: effects of CS
duration J Exp Anal Behav 15, 243–247.
Moore, J., Fantino, E., 1975 Choice and response contingencies J Exp Anal Behav.
23, 339–347.
Neimark, E.D., Estes, W.K., 1967 Stimulus Sampling Theory Holden-Day, San
Fran-cisco.
Nickerson, R.S., 2009 Mathematical Reasoning Patterns, Problems, Conjectures, and Proofs Psychology Press, London.
Niv, Y., Joel, D., Meilijson, I., Ruppin, E., 2002 Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors Adapt Behav 10, 5–24.
Ong, E.L., White, K.G., 2004 Amount-dependent temporal discounting? Behav Pro-cesses 66, 201–212.
Ordu ˜na, V., Hong, E., Bouzas, A., 2007 Interval bisection in spontaneously hyperten-sive rats Behav Processes 74, 107–111.
Pearce, J.M., Hall, G., 1978 Overshadowing the instrumental conditioning of a lever-press response by a more valid predictor of the reinforcer J Exp Psychol Anim Behav Process 4, 356–367.
Pitts, R.C., Febbo, S.M., 2004 Quantitative analyses of methamphetamine’s effects
on self-control choices: implications for elucidating behavioral mechanisms of drug action Behav Processes 66, 213–233.
Rachlin, H., 1971 On the tautology of the matching law J Exp Anal Behav 15, 249–251.
Rachlin, H., 1992 Diminishing marginal value as delay discounting J Exp Anal Behav 57, 407–415.
Rachlin, H., 2006 Notes on discounting J Exp Anal Behav 85, 425–435 Reed, P., Doughty, A.H., 2005 Within-subject testing of the signaled-reinforcement effect on operant responding as measured by response rate and resistance to change J Exp Anal Behav 83, 31–45.
Reilly, M.P., Posadas-Sanchez, D., Kettle, L.C., Killeen, P.R., 2011 Making the trip worthwhile: do rats (Rattus norvegicus) and pigeons (Columba livia) forage prospectively? Behav Processes, in review.
Richards, R.W., 1981 A comparison of signaled and unsignaled delay of reinforce-ment J Exp Anal Behav 35, 145–152.
Roberts, W.A., Feeney, M.C., 2009 The comparative study of mental time travel Trends Cogn Sci 13, 271–277.
Roberts, W.A., Feeney, M.C., McMillan, N., MacPherson, K., Musolino, E., Petter, M.,
2009 Do pigeons (Columba livia) study for a test? J Exp Psychol Anim Behav Process 35, 129–142.
Sagvolden, T., Johansen, E.B., Wøien, G., Walaas, S.I., Storm-Mathisen, J., Bergersen, L.H., Hvalby, Ø., Jensen, V., Aase, H., Russell, V.A., Killeen, P.R., DasBanerjee, T., Middleton, F.A., Faraone, S.V., 2009 The spontaneously hypertensive rat model
of ADHD—the importance of selecting the appropriate reference strain Neu-ropharmacology 57, 619–626.
Sanabria, F., Killeen, P.R., 2007 Temporal generalization accounts for response resur-gence in the peak procedure Behav Processes 74, 126–141.
Schaal, D.W., Branch, M.N., 1988 Responding of pigeons under variable-interval schedules of unsignaled, briefly signaled, and completely signaled delays to reinforcement J Exp Anal Behav 50, 33–54.
Schaal, D.W., Branch, M.N., 1990 Responding of pigeons under variable-interval schedules of signaled-delayed reinforcement: effects of delay-signal duration J Exp Anal Behav 53, 103–121.
Schwartz, B., 1976 Positive and negative conditioned suppression in the pigeon: effects of the locus and modality of the CS Learn Motiv 7, 86–100.
Shettleworth, S.J., Plowright, C., 1989 Time horizons of pigeons on a two-armed bandit Anim Behav 37, 610–623.
Singh, S.P., Sutton, R.S., 1996 Reinforcement learning with replacing eligibility traces Mach Learn 22, 123–158.
Sutton, R.S., Barto, A.G., 1990 Time-derivative models of Pavlovian reinforcement In: Gabriel, M., Moore, J (Eds.), Learning and Computational Neuroscience: Foun-dations of Adaptive Networks MIT Press, Cambridge, MA.
Thomas, G.V., Lieberman, D.A., McIntosh, D.C., Ronaldson, P., 1983 The role of mark-ing when reward is delayed J Exp Psychol Anim Behav Process 9, 401–411 Timberlake, W., Lucas, G.A., 1990 Behavior systems and learning: from misbehavior
to general principles In: Klein, S.B., Mowrer, R.R (Eds.), Contemporary Learning Theories: Instrumental Conditioning Theory and the Impact of Constraints on Learning Erlbaum, Hillsdale, NJ.
Uttal, W.R., 2000 The War Between Mentalism and Behaviorism: On the Accessibil-ity of Mental Processes Lawrence Erlbaum Associates, Inc., Mahwah, NJ Uttal, W.R., 2008 Time, Space, and Number in Physics and Psychology Sloan Pub-lishing, Cornwall-on-Hudson, NY.
Wilkenfield, J., Nickel, M., Blakely, E., Poling, A., 1992 Acquisition of lever-press responding in rats with delayed reinforcement: a comparison of three proce-dures J Exp Anal Behav 58, 431–443.
Williams, B.A., 1999 Associative competition in operant conditioning: blocking the response–reinforcer association Psychon Bull Rev 6, 618–623.
Williams, B.A., Dunn, R., 1991 Preference for conditioned reinforcement J Exp Anal Behav 55, 37–46.