Running head: DISCOUNTING DISCOUNTING Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Peter R.. Models of tr
Trang 1Running head: DISCOUNTING DISCOUNTING
Models of trace decay, eligibility for reinforcement, and delay of reinforcement
gradients, from exponential to hyperboloid
Peter R Killeen Arizona State University
Prepublication preprint
Killeen, P R (2011) Models of trace decay, eligibility for reinforcement, and delay of
reinforcement gradients, from exponential to hyperboloid Behavioural Processes, 87, 57–63 DOI: 10.1016/j.beproc.2010.12.016
Trang 2Abstract
Behavior such as depression of a lever or perception of a stimulus may be strengthened
by consequent behaviorally-significant events (BSEs), such as reinforcers This is the Law of Effect As time passes since its emission, the ability for the behavior to be reinforced decreases This is trace decay It is upon decayed traces that subsequent BSEs operate If the trace comes from a response, it constitutes primary reinforcement; if from perception of an extended
stimulus, it is classical conditioning This paper develops simple models of these processes It premises exponentially-decaying traces related to the richness of the environment, and
conditioned reinforcement as the average of such traces over the extended stimulus, yielding an almost-hyperbolic function of duration The models account for some data, and reinforce the theories of other analysts by providing a sufficient account of the provenance of these effects It leads to a linear relation between sooner and later isopreference delays whose slope depends on sensitivity to reinforcement, and intercept on that and the steepness of the delay gradient Unlike human prospective judgments, all control is vested in either primary or secondary reinforcement
processes; therefore the use of the term discounting, appropriate for humans, may be less
descriptive of the behavior of nonverbal organisms
Keywords: Delay of reinforcement gradients, discounting, forced choice paradigms, magnitude
effect, matching paradigms, reinforcement learning, trace decay gradients
Trang 3Pigeons cannot reliably count above 3 (Brannon et al., 2001; Nickerson, 2009; Uttal, 2008), have short time-horizons (Shettleworth and Plowright, 1989), may be stuck in time (Roberts and Feeney, 2009), do not ask for the answers to the questions they are about to be asked (Roberts et al., 2009), and fail to negotiate an amount of reinforcement commensurate with the work that they are about to undertake (Reilly et al., 2011) How do such simple creatures discount future payoffs as a function of their delay? It is the thesis of this paper that they do not That the orderly data in such studies is the simple result of the dilution of the conditioned reinforcers which support and guide that choice, as a function of the delay to the outcome that they signal
Classic and generally accepted concepts of causality preclude events from acting
backward in time Then what sense do we make of Fig 1, a familiar rendition of the control exerted by delayed reinforcers? How do the animals know what’s coming? Only three accounts come to mind a) Precognition But causality rules that out b) It is memory of a past choice that makes contact with reinforcement; the figure should be reversed Or, c) the animals have learned what leads to what There follows an extended argument that (b) and (c) are both true, and that in novel contexts, (b) typically leads to (c)
>> Please place Fig 1 hereabouts << When in the course of an animals’ behavior a behaviorally significant event (BSE; or
phylogenetically important event (Baum, 2005); or more familiarly, incentive, reinforcer, or unconditioned stimulus) occurs, there immediately arises the question of whence In computer science this is the assignment of credit problem If the organism, or software, takes into account events in the last instant, there are r potential causes for the BSE, where r is a measure of the richness of context An additional r events occurred in the prior, penultimate instant The
Trang 4combination of any one of these with those in the ultimate instant could have been the causal
chain that led to the BSE: r2 sequences in toto Extending the account further, to the
antepenultimate instant, raises the pool to r3 Continue this process back and the candidate pool
of sequences grows as r n , where n is the depth of query If each of these instants of apprehension lasts δ s, then n = d/δ, and the candidate path grows as r d/δ, where d is the delay between event and consequence In the continuous limit, this equals e d/τ, where τ is the time constant of the
traces—the inverse of the continuous limit of the richness parameter r This means that the gradients get steeper in rich environments: τ = 1/r It follows that any one causal path is eligible for 1/e d/τ of the credit for reinforcement, everything being equal Of course everything is not equal: The priors on some events are higher than on others, either because of their phylogenetic relevance, or their memorability, which may be enhanced by marking their occurrence with
salient stimuli Allowing for such bias, represented by the parameter c, we would expect the
causal impact to decrease with time prior as:
which is the point association1 of an event at d seconds remove from the BSE, as seen in Fig 2
>> Please place Fig 2 hereabouts << This story for why associability between an event and a subsequent BSE may decay exponentially, retold from (Johansen et al., 2009) and (Killeen, 2005), has some empirical
support (Killeen, 2001; Escobar and Bruner, 2007) Eligibility traces play a central role in AI reinforcement learning (Singh and Sutton, 1996) Classic models such as Sutton and Barto’s
1
If the duration of a response is δ s, then the impact of reinforcement on it is given by the
integral of Eq 1 from d to d + δ For brief events such as responses, this essentially equals δ
times the right-hand side of Eq 1 For responses of similar durations this coefficient is absorbed
by c
Trang 5posit a geometrically decreasing representation of events similar to that developed here, and work to reconcile details of instrumental and Pavlovian conditioning with various instantiations
of such traces (Sutton and Barto, 1990; Niv et al., 2002) Alternatively, it is possible to simply posit exponential or hyperbolic decay of memory of the stimulus, and also that these traces may
or may not vary with the richness of the environment This has been the productive tactic of most analysts of delay discounting If this disposition is good enough for you, skip the next 5 pages
What is the purported mechanism? As developed here it is one of stimulus competition, with richer environments and greater interludes providing more opportunities for interference A stimulus-sampling model of acquisition (Bower, 1994; Estes and Suppes, 1974; Neimark and Estes, 1967; Estes, 1950) provides the basis of a model of acquisition in the face of such
contingencies degraded by delay and distraction (Killeen, 2001) It is not repeated here Another way to think of Eq 1 is as a measure of the signal-to-noise ratio of a delay contingency In the
case c = 1/τ, Eq 1 describes a probability distribution, so that identification of one point from the
distribution reduces candidate uncertainty by log2(eτ) bits
What is the relation between eligibility traces and the delay of reinforcement gradient? Fig
3 shows 7 trace gradients for events occurring more and more remote from the BSE The most proximate occurs at the moment of reinforcement, and is visible only as a dot in the upper right corner; it receives the full credit for which it might be eligible An event occurring 1 time step earlier has an impact diluted by about 30% by the time of reinforcement, as inferred from where its trace cuts the origin, the zero delay axis at the right of the graph Draw this measure of
eligibility, 0.7, out 1 unit from the right frame, as shown by the arrow, and connect it to the full measure in the corner by a dashed line The event 2 steps back decays by about 50% at the time
of reinforcement; draw a line from there extending to the left at 2, and continue the dashed line
to it When bored of this construction, stop to consider the shape of the delay of reinforcement
Trang 6gradient—the dashed line When smoothed, it will have exactly the same shape as any of the decay traces, but will be reflected about its new origin at 0
>> Please place Fig 3 hereabouts << The distinction between these two representations, one of process and the other of product,
is important As Fig 2 makes clear, what is present at the time of reinforcement is a decayed trace of a response Differential reinforcer magnitude can have no retroactive effect on the shape
or elevation of those traces Reinforcers of different magnitudes do not change the decay
gradients, but rather act differentially on their tails: A larger reinforcer may be more effective at leveraging the same residual memory than a small one But those tails may be of different
elevation—and thus differentially able to receive the effect of the reinforcement—because they
are more or less memorable (reflected in c) or because they occur in a richer or bleaker
environment (reflected in τ)
2.0 Hyperbolic Dilemmas
How can gradients be exponential when everyone says that they are hyperbolic? The curves
in Fig 1 don’t cross, whereas most representations of discounted future events of differing value
do These three figures address the associability of a discrete event at a remove of t from
reinforcement They do not address situations in which that event leads to an immediate change
of state signaling a deferred outcome A signal of change of state marks the precipitating event
by immediately singling it out as the precursor of a better (or possibly worse) state of affairs
Consider a response that causes the onset of a stimulus, and after a delay of d, a BSE Assume
that each of the temporal elements of the stimulus receives associations as given by Eq 1, and that these are otherwise equivalent in time (that is, that the parameters of Eq 1 don’t change over the delay) In the case that one elements of the stimulus is highly generalizable with the next,
Trang 7these associations add linearly, giving a total associability equal to ce −t /τ
0
d
∫ dt This integral
assumes that the temporal elements dt make linearly independent contributions to the total
association Because one element of the stimulus is, per hypothii, indiscriminable from the next, any one element—in particular the one just following a response—has an average associability
Eq 2 is not discriminable from the inverse linear relation known as hyperbolic (Killeen,
2001) Fig 4 demonstrates this similarity by fitting both Eq 2, and the hyperbola
to data from Richards (1981) that describe the effects of signaled delayed reinforcement on the average response rates of four pigeons The curves through the squares superimpose This makes sense, as Eq 3 is a series approximation2 to Eq 2
Trang 8Experienced laboratory animals can tell the difference between the start of a long delay and
the start of a short one; they are sensitive to time and delay (Moore and Fantino, 1975) The use
of Eq 2 requires that, facing start of a long delay to food and a stimulus which—in the best of
times—is contiguous with food, control by the stimulus dominates that by time Animals, in
other words, are optimists: Their behavior is primarily under the control of the most hopeful
stimuli rather than some weighted average of predictive stimuli There is good evidence that this
is often the case (Horney and Fantino, 1984; Sanabria and Killeen, 2007; Jenkins and Boakes,
1973)
Also shown in Fig 4 is the decay trace for unsignalled reinforcement Under the hypothesis
of the prior section, it is given by Eq 1, an exponential function, shown as the continuous curve
passing near the disks, showing response rates for unsignalled (non-resetting) delays Also
shown is the hyperbola, Eq 3, which apparently gives an inferior fit to these data—although this
data-base is too limited to make secure generalizations For unsignalled delayed reinforcement,
at least in this case, the exponential gradients are, as predicted, competitive with the more
traditional hyperbolic gradients Fig 4 illustrates Lattal’s generalization that “The unsignaled
delay gradient … is characterized by [generally] lower response rates and a steeper slope than
the gradient obtained with otherwise equivalent signaled delays” (Lattal, 2010)
>> Please place Fig 4 hereabouts <<
The average absolute deviation between Eqs 2 and 3 over the range from 0.99 to 0.04 is
0.064; however letting the time constant in either equation vary from its value in the other
reduces this deviation to 0.023, within experimental error
The exponential term may also be approximated with the more standard Maclaurin series:
Trang 9Whereas Fig 4 usefully compares the effects of signaled and unsignalled delays, because the unsignalled delays were non-resetting, the actually experienced delays were variable and less,
by an unspecified amount, than the abcissae A better test of the sufficiency of Eq 1 comes from (Wilkenfield et al., 1992), using resetting delays, where the abcissae provide accurate
representations of the experienced delays These investigators reported the response rates during acquisition of lever pressing from four groups of rats, nine in each group Their data from the first 100 minutes of acquisition are shown in Fig 5 Again, the exponential provides a plausible model
>> Please place Fig 5 hereabouts << The simple hyperbolic model has been shown adequate for most discount functions for non-verbal animals (Green and Myerson, 2004; Ong and White, 2004; Green et al., 2004) But
unlike its cousin the hyperbola, which is ad hoc, Eq 2 has some theoretical motivation: It
predicts radical changes in preference as a function of the nature and continuity of the stimuli that bridge the delay between response and BSE, and holds out the promise for quantifying those effects It is consistent with the important role of conditioned reinforcers in preference for
delayed outcomes (Williams and Dunn, 1991), and provides a useful refinement to a unified theory of choice (Killeen and Fantino, 1990) In the latter theory, and its precedent (Killeen,
1982, 1982), the control by a delayed reinforcer was modeled as the sum of both the primary (i.e point association with the response; Eq 1) and secondary (i.e., terminal link cues; essentially Eq 2) reinforcement effects Equation 1, and a similar logic for the association of streams of
responses with reinforcement, is the heart of the model of coupling in my theory of schedule effects, MPR (Killeen, 1994)
Trang 10The presence of stimuli occurring between a response and BSE may not always be
beneficial to conditioning the response Brief stimuli occurring immediately after a response
(marking it) may make the response more memorable when the BSE occurs (Lieberman,
McIntosh, and Thomas, 1979; Thomas et al., 1983)—perhaps by increasing the value of c
Alternatively, such stimuli may initiate adjunctive behavior that serves as an extended
conditioned stimulus (CS) (Schaal and Branch, 1988) Conversely, brief stimuli occurring just before reinforcement may block control by the response-reinforcer association (Pearce and Hall,
1978) Williams (1999) and Reed and Doughty (2005) demonstrated the power of both effects in the same experiments Whether the effects of primary and secondary reinforcement add or
interfere depends on the correlation of each of the contingencies with the behavior measured by
the experimenter: A CS whose presentation is not contingent on behavior will only
adventitiously strengthen the target response, and, depending on temporal variables, is as likely
to compete with it; furthermore, one which signals non-contingent reinforcement will compete
with concurrent instrumental responses (Miczek and Grossman, 1971) A CS presented on the instrumental operandum can enhance response rate, whereas one presented on a different
operandum can compete with it (Schwartz, 1976) As the duration of a marking stimulus extends into the delay interval, integration of Eq 2 between its endpoints predicts a positively
accelerating effectiveness of the stimulus Schaal and Branch (1990) found the predicted
increase, but it was negatively accelerated for 2 of the 3 pigeons
The association of a CS or response with the measured behavior will also depend on the modality of the CS, the modality of the response (Timberlake and Lucas, 1990), and the
contingencies that make the correlation tight or weak (Killeen and Bizo, 1998) For the present argument, these correlations of response and CS with the experimenter’s dependent variable are
carried by the constant c
Trang 113.0 The Effects of Delay on Choice
To apply Eq 2 to experiments in which an animal is choosing between delayed reinforcers
of different magnitudes (a) requires a scale that maps amount into reinforcing effectiveness
Perhaps the simplest “utility” function for reinforcement amount is the power function, which is the form assumed in the generalized matching law (Rachlin, 1971; Killeen, 1972; Baum, 1979)
It has the advantage of simplicity, and fits most of the available data over its limited range A disadvantage is that it has the effectiveness of reinforcement growing without bound as the amount is increased, which is implausible Rachlin has derived other forms for utility from first principles (1992); his logarithmic, and my (1985) exponential-integral can also accommodate data, as can Bradshaw and associates’ hyperbolic discounting of amount (Bezzina et al., 2007) However, the equations look simpler if we adopt the formalism of the generalized matching law
in which the reinforcing power of amount is the power function, u(a) = aα Then the associative
strength of a response immediately followed by a stimulus change, and d later a BSE of physical magnitude a, is the product of the impact of the BSE, aα, on the sum of the primary !s d and
secondary sd effects Assuming for parsimony that in the cases analyzed the relative salience of
stimulus elements and responses are comparable, then cprimary ≈ csecondary = c, and:
)
3.1 Methods of Adjustment
Psychophysical paradigms in which variables are adjusted to cause indifference in
preferences or other judgments—“Matching paradigms” (Farell and Pelli, 1999)—are more secure of interpretation than those involving a psychological scale, such as one of value (Hand, 2004; Uttal, 2000) Their units are physical measurements, and they refer to a unique
Trang 12psychological point, that of equivalence This may be determined whether the underlying scale is interval, ordinal, or even nominal
How great must an amount a1 be to balance a different amount a2 at a different delay? Set
))= a2
)),
and solve for a1:
1/ α
(5)
Equation 5 gives the relative equivalent value of amount a2 delayed d 2, compared to an
alternative delayed d1 Typically, d1 is “immediate”—that is, around ½ s, and then Eq 5 gives
the relative immediate equivalent amount With d 2 > d 1, this ratio will be less than 1, indicating
that a smaller immediate amount, relative to a2, suffices to balance the latter at a remove of d 2 Note that neither amount appears in the right hand side; no magnitude effect is predicted: As long as the ratio of delays is the same, the predictions are the same when both amounts are
multiplied by a constant In general, no magnitude effect is found in delay discounting
experiments with non-human animals (Green et al., 2004; Ong and White, 2004) Figure 6 shows the course of Eq 5, with α = 1.26 and τ = 2.12 s, passing near the average data from four
pigeons in an experiment where the amount delivered after ½ s was adjusted to maintain
indifference between it and a larger amount (given by the parameter in the figure) delivered at a delay
>> Please place Fig 6 hereabouts <<
Trang 13The primary and conditioned reinforcing effects are highly correlated; Eq 5 may be
simplified by deleting the primary influence of the reinforcers on the choice responses, to yield:
1/ α
which draws the continuous curve through the data in Fig 6 But the primary and secondary effects may be dissociated, and when they are, alternatives with both are preferred to those with just primary reinforcement (Marcattilio and Richards, 1981; Lattal, 1984) The hyperbolic
approximation to Eq 6 provides a decent fit to these data as well, but falls noticeably farther from their average than do Eqs 5 and 6
In some matching experiments, the delay to one outcome is adjusted, rather than the amount
Eq 7 yields no simple prediction, but invoking the series expansion of the exponential termii that was used in going from Eq 2 to Eq 3:
s
d i ,a i ≈ a i
αc
1+ di/τ , leads to the simple linear relation of Eq 7
,,
-/
Operations that increase the sensitivity to reinforcement (increase α) or flatten the gradient
(increase τ) will increase the indifference point, d2 The provenance of the effect can be
determined by manipulating d1, as the former will increase both slope and intercept, and the latter only intercept Some drugs, such as stimulants, may decrease α while increasing τ
Trang 14(Maguire et al., 2009; Pitts and Febbo, 2004), and their results will thus vary as a function of the
balance between the two, largely determined by the value of d1 A linear equation such as (7), based on multiplicative hyperbolic functions of amount and delay, was proposed and validated
by Mazur (2001), and independently by Bradshaw’s group (Ho et al., 1999; Bezzina et al., 2007;
da Costa Araújo et al., 2009) In Bradshaw’s model, as in Eq 7, the slope depends on relative payoffs regulated by the amount amplifier parameter α, and the intercept on a multiplicative function of that and delay sensitivity Their model has also been applied to human delay
discounting (Hinvest and Anderson, 2010; Liang et al., 2010)
3.2 Methods of Forced Choice
An alternative psychophysical procedure involves the measurement of the degree of
preference between two fixed alternatives, or the frequency of choosing one over the other Equation 4 may be rearranged to predict the outcome of choice experiments in which the delays and outcomes are invariant The relative associative strength of the alternatives is:
,,
-//
−1
(8)
In the case of unbiased choice there are two free parameters, the rate of diminishing
marginal utility for larger amounts, α, and the time constant of the memory trace, τ Note that amounts again appear as a ratio, indicating scale invariance: There is no magnitude effect Fig 7 shows this model follows a path similar to the data of Fox, Hand, and Reilly (2008), who asked whether rat models of ADHD (SHRs) would show steeper delay gradients than control (WKY) rats They did Other investigators (Adriani et al., 2003) did not find steeper gradients for SHR, but observed very large individual differences As noted by (Orduña, Hong, and Bouzas, 2007)
Trang 15the main effect found by Fox and associates may be due to idiosyncrasies of their control rats (Sagvolden et al., 2009)
>> Please place Fig 7 hereabouts <<
3 Discussion
Prospective judgments of equivalent amounts by humans, typical in the delay-discounting literature, require computations that are different in kind from those of paradigms in which real delays are conditioned to discriminative stimuli Humans can be instructed to contemplate the desirability of ten thousand dollars in ten years, and to stipulate how little they would settle for one week hence in lieu of it The performance entails a scale of future time, the value of an outcome deferred by that delay, and concatenation of the non-linear time-scale with a non-linear amount scale, from which a variety of results are imaginable (Killeen, 2009; Rachlin, 2006) Little wonder that there are differences in covering models The only way to so instruct other animals is to expose them to such realities repeatedly The assertion in the opening of this paper that the future cannot act on non-verbal animals was meant to emphasize this difference: On the one hand verbally presented unexperienced hypotheticals that can control human responses, and
on the other the conditioning of behavior reinforced by the presentation of conditioned
reinforcers signaling real, experienced, delays, that controls pigeon and rat behavior
This paper should be read as a grounding of hyperbolic models of delay discounting, not a critique of them It presented a few ideas First, it observed that Fig 1 is not a model of a
process It is a summary of some other kind of process, such as the one proposed in Fig 3 The distinction is important, as thinking of Fig 1 as a process can be misleading I am not alone in this concern:
In this [Fig 1] view, … reinforcers reach back in time to effect this response in the presence
of the remembered stimulus … As a model of how an animal adapts to, or learns about, situations