Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid (2)

Running head: DISCOUNTING DISCOUNTING Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Peter R.. Models of tr

Trang 1

Running head: DISCOUNTING DISCOUNTING

Models of trace decay, eligibility for reinforcement, and delay of reinforcement

gradients, from exponential to hyperboloid

Peter R Killeen Arizona State University

Prepublication preprint

Killeen, P R (2011) Models of trace decay, eligibility for reinforcement, and delay of

reinforcement gradients, from exponential to hyperboloid Behavioural Processes, 87, 57–63 DOI: 10.1016/j.beproc.2010.12.016

Trang 2

Abstract

Behavior such as depression of a lever or perception of a stimulus may be strengthened

by consequent behaviorally-significant events (BSEs), such as reinforcers This is the Law of Effect As time passes since its emission, the ability for the behavior to be reinforced decreases This is trace decay It is upon decayed traces that subsequent BSEs operate If the trace comes from a response, it constitutes primary reinforcement; if from perception of an extended

stimulus, it is classical conditioning This paper develops simple models of these processes It premises exponentially-decaying traces related to the richness of the environment, and

conditioned reinforcement as the average of such traces over the extended stimulus, yielding an almost-hyperbolic function of duration The models account for some data, and reinforce the theories of other analysts by providing a sufficient account of the provenance of these effects It leads to a linear relation between sooner and later isopreference delays whose slope depends on sensitivity to reinforcement, and intercept on that and the steepness of the delay gradient Unlike human prospective judgments, all control is vested in either primary or secondary reinforcement

processes; therefore the use of the term discounting, appropriate for humans, may be less

descriptive of the behavior of nonverbal organisms

Keywords: Delay of reinforcement gradients, discounting, forced choice paradigms, magnitude

effect, matching paradigms, reinforcement learning, trace decay gradients

Trang 3

Pigeons cannot reliably count above 3 (Brannon et al., 2001; Nickerson, 2009; Uttal, 2008), have short time-horizons (Shettleworth and Plowright, 1989), may be stuck in time (Roberts and Feeney, 2009), do not ask for the answers to the questions they are about to be asked (Roberts et al., 2009), and fail to negotiate an amount of reinforcement commensurate with the work that they are about to undertake (Reilly et al., 2011) How do such simple creatures discount future payoffs as a function of their delay? It is the thesis of this paper that they do not That the orderly data in such studies is the simple result of the dilution of the conditioned reinforcers which support and guide that choice, as a function of the delay to the outcome that they signal

Classic and generally accepted concepts of causality preclude events from acting

backward in time Then what sense do we make of Fig 1, a familiar rendition of the control exerted by delayed reinforcers? How do the animals know what’s coming? Only three accounts come to mind a) Precognition But causality rules that out b) It is memory of a past choice that makes contact with reinforcement; the figure should be reversed Or, c) the animals have learned what leads to what There follows an extended argument that (b) and (c) are both true, and that in novel contexts, (b) typically leads to (c)

>> Please place Fig 1 hereabouts << When in the course of an animals’ behavior a behaviorally significant event (BSE; or

phylogenetically important event (Baum, 2005); or more familiarly, incentive, reinforcer, or unconditioned stimulus) occurs, there immediately arises the question of whence In computer science this is the assignment of credit problem If the organism, or software, takes into account events in the last instant, there are r potential causes for the BSE, where r is a measure of the richness of context An additional r events occurred in the prior, penultimate instant The

Trang 4

combination of any one of these with those in the ultimate instant could have been the causal

chain that led to the BSE: r2 sequences in toto Extending the account further, to the

antepenultimate instant, raises the pool to r3 Continue this process back and the candidate pool

of sequences grows as r n , where n is the depth of query If each of these instants of apprehension lasts δ s, then n = d/δ, and the candidate path grows as r d/δ, where d is the delay between event and consequence In the continuous limit, this equals e d/τ, where τ is the time constant of the

traces—the inverse of the continuous limit of the richness parameter r This means that the gradients get steeper in rich environments: τ = 1/r It follows that any one causal path is eligible for 1/e d/τ of the credit for reinforcement, everything being equal Of course everything is not equal: The priors on some events are higher than on others, either because of their phylogenetic relevance, or their memorability, which may be enhanced by marking their occurrence with

salient stimuli Allowing for such bias, represented by the parameter c, we would expect the

causal impact to decrease with time prior as:

which is the point association1 of an event at d seconds remove from the BSE, as seen in Fig 2

>> Please place Fig 2 hereabouts << This story for why associability between an event and a subsequent BSE may decay exponentially, retold from (Johansen et al., 2009) and (Killeen, 2005), has some empirical

support (Killeen, 2001; Escobar and Bruner, 2007) Eligibility traces play a central role in AI reinforcement learning (Singh and Sutton, 1996) Classic models such as Sutton and Barto’s

1

If the duration of a response is δ s, then the impact of reinforcement on it is given by the

integral of Eq 1 from d to d + δ For brief events such as responses, this essentially equals δ

times the right-hand side of Eq 1 For responses of similar durations this coefficient is absorbed

by c

Trang 5

posit a geometrically decreasing representation of events similar to that developed here, and work to reconcile details of instrumental and Pavlovian conditioning with various instantiations

of such traces (Sutton and Barto, 1990; Niv et al., 2002) Alternatively, it is possible to simply posit exponential or hyperbolic decay of memory of the stimulus, and also that these traces may

or may not vary with the richness of the environment This has been the productive tactic of most analysts of delay discounting If this disposition is good enough for you, skip the next 5 pages

What is the purported mechanism? As developed here it is one of stimulus competition, with richer environments and greater interludes providing more opportunities for interference A stimulus-sampling model of acquisition (Bower, 1994; Estes and Suppes, 1974; Neimark and Estes, 1967; Estes, 1950) provides the basis of a model of acquisition in the face of such

contingencies degraded by delay and distraction (Killeen, 2001) It is not repeated here Another way to think of Eq 1 is as a measure of the signal-to-noise ratio of a delay contingency In the

case c = 1/τ, Eq 1 describes a probability distribution, so that identification of one point from the

distribution reduces candidate uncertainty by log2(eτ) bits

What is the relation between eligibility traces and the delay of reinforcement gradient? Fig

3 shows 7 trace gradients for events occurring more and more remote from the BSE The most proximate occurs at the moment of reinforcement, and is visible only as a dot in the upper right corner; it receives the full credit for which it might be eligible An event occurring 1 time step earlier has an impact diluted by about 30% by the time of reinforcement, as inferred from where its trace cuts the origin, the zero delay axis at the right of the graph Draw this measure of

eligibility, 0.7, out 1 unit from the right frame, as shown by the arrow, and connect it to the full measure in the corner by a dashed line The event 2 steps back decays by about 50% at the time

of reinforcement; draw a line from there extending to the left at 2, and continue the dashed line

to it When bored of this construction, stop to consider the shape of the delay of reinforcement

Trang 6

gradient—the dashed line When smoothed, it will have exactly the same shape as any of the decay traces, but will be reflected about its new origin at 0

>> Please place Fig 3 hereabouts << The distinction between these two representations, one of process and the other of product,

is important As Fig 2 makes clear, what is present at the time of reinforcement is a decayed trace of a response Differential reinforcer magnitude can have no retroactive effect on the shape

or elevation of those traces Reinforcers of different magnitudes do not change the decay

gradients, but rather act differentially on their tails: A larger reinforcer may be more effective at leveraging the same residual memory than a small one But those tails may be of different

elevation—and thus differentially able to receive the effect of the reinforcement—because they

are more or less memorable (reflected in c) or because they occur in a richer or bleaker

environment (reflected in τ)

2.0 Hyperbolic Dilemmas

How can gradients be exponential when everyone says that they are hyperbolic? The curves

in Fig 1 don’t cross, whereas most representations of discounted future events of differing value

do These three figures address the associability of a discrete event at a remove of t from

reinforcement They do not address situations in which that event leads to an immediate change

of state signaling a deferred outcome A signal of change of state marks the precipitating event

by immediately singling it out as the precursor of a better (or possibly worse) state of affairs

Consider a response that causes the onset of a stimulus, and after a delay of d, a BSE Assume

that each of the temporal elements of the stimulus receives associations as given by Eq 1, and that these are otherwise equivalent in time (that is, that the parameters of Eq 1 don’t change over the delay) In the case that one elements of the stimulus is highly generalizable with the next,

Trang 7

these associations add linearly, giving a total associability equal to ce −t /τ

0

d

∫ dt This integral

assumes that the temporal elements dt make linearly independent contributions to the total

association Because one element of the stimulus is, per hypothii, indiscriminable from the next, any one element—in particular the one just following a response—has an average associability

Eq 2 is not discriminable from the inverse linear relation known as hyperbolic (Killeen,

2001) Fig 4 demonstrates this similarity by fitting both Eq 2, and the hyperbola

to data from Richards (1981) that describe the effects of signaled delayed reinforcement on the average response rates of four pigeons The curves through the squares superimpose This makes sense, as Eq 3 is a series approximation2 to Eq 2

Trang 8

Experienced laboratory animals can tell the difference between the start of a long delay and

the start of a short one; they are sensitive to time and delay (Moore and Fantino, 1975) The use

of Eq 2 requires that, facing start of a long delay to food and a stimulus which—in the best of

times—is contiguous with food, control by the stimulus dominates that by time Animals, in

other words, are optimists: Their behavior is primarily under the control of the most hopeful

stimuli rather than some weighted average of predictive stimuli There is good evidence that this

is often the case (Horney and Fantino, 1984; Sanabria and Killeen, 2007; Jenkins and Boakes,

1973)

Also shown in Fig 4 is the decay trace for unsignalled reinforcement Under the hypothesis

of the prior section, it is given by Eq 1, an exponential function, shown as the continuous curve

passing near the disks, showing response rates for unsignalled (non-resetting) delays Also

shown is the hyperbola, Eq 3, which apparently gives an inferior fit to these data—although this

data-base is too limited to make secure generalizations For unsignalled delayed reinforcement,

at least in this case, the exponential gradients are, as predicted, competitive with the more

traditional hyperbolic gradients Fig 4 illustrates Lattal’s generalization that “The unsignaled

delay gradient … is characterized by [generally] lower response rates and a steeper slope than

the gradient obtained with otherwise equivalent signaled delays” (Lattal, 2010)

>> Please place Fig 4 hereabouts <<

The average absolute deviation between Eqs 2 and 3 over the range from 0.99 to 0.04 is

0.064; however letting the time constant in either equation vary from its value in the other

reduces this deviation to 0.023, within experimental error

The exponential term may also be approximated with the more standard Maclaurin series:

Trang 9

Whereas Fig 4 usefully compares the effects of signaled and unsignalled delays, because the unsignalled delays were non-resetting, the actually experienced delays were variable and less,

by an unspecified amount, than the abcissae A better test of the sufficiency of Eq 1 comes from (Wilkenfield et al., 1992), using resetting delays, where the abcissae provide accurate

representations of the experienced delays These investigators reported the response rates during acquisition of lever pressing from four groups of rats, nine in each group Their data from the first 100 minutes of acquisition are shown in Fig 5 Again, the exponential provides a plausible model

>> Please place Fig 5 hereabouts << The simple hyperbolic model has been shown adequate for most discount functions for non-verbal animals (Green and Myerson, 2004; Ong and White, 2004; Green et al., 2004) But

unlike its cousin the hyperbola, which is ad hoc, Eq 2 has some theoretical motivation: It

predicts radical changes in preference as a function of the nature and continuity of the stimuli that bridge the delay between response and BSE, and holds out the promise for quantifying those effects It is consistent with the important role of conditioned reinforcers in preference for

delayed outcomes (Williams and Dunn, 1991), and provides a useful refinement to a unified theory of choice (Killeen and Fantino, 1990) In the latter theory, and its precedent (Killeen,

1982, 1982), the control by a delayed reinforcer was modeled as the sum of both the primary (i.e point association with the response; Eq 1) and secondary (i.e., terminal link cues; essentially Eq 2) reinforcement effects Equation 1, and a similar logic for the association of streams of

responses with reinforcement, is the heart of the model of coupling in my theory of schedule effects, MPR (Killeen, 1994)

Trang 10

The presence of stimuli occurring between a response and BSE may not always be

beneficial to conditioning the response Brief stimuli occurring immediately after a response

(marking it) may make the response more memorable when the BSE occurs (Lieberman,

McIntosh, and Thomas, 1979; Thomas et al., 1983)—perhaps by increasing the value of c

Alternatively, such stimuli may initiate adjunctive behavior that serves as an extended

conditioned stimulus (CS) (Schaal and Branch, 1988) Conversely, brief stimuli occurring just before reinforcement may block control by the response-reinforcer association (Pearce and Hall,

1978) Williams (1999) and Reed and Doughty (2005) demonstrated the power of both effects in the same experiments Whether the effects of primary and secondary reinforcement add or

interfere depends on the correlation of each of the contingencies with the behavior measured by

the experimenter: A CS whose presentation is not contingent on behavior will only

adventitiously strengthen the target response, and, depending on temporal variables, is as likely

to compete with it; furthermore, one which signals non-contingent reinforcement will compete

with concurrent instrumental responses (Miczek and Grossman, 1971) A CS presented on the instrumental operandum can enhance response rate, whereas one presented on a different

operandum can compete with it (Schwartz, 1976) As the duration of a marking stimulus extends into the delay interval, integration of Eq 2 between its endpoints predicts a positively

accelerating effectiveness of the stimulus Schaal and Branch (1990) found the predicted

increase, but it was negatively accelerated for 2 of the 3 pigeons

The association of a CS or response with the measured behavior will also depend on the modality of the CS, the modality of the response (Timberlake and Lucas, 1990), and the

contingencies that make the correlation tight or weak (Killeen and Bizo, 1998) For the present argument, these correlations of response and CS with the experimenter’s dependent variable are

carried by the constant c

Trang 11

3.0 The Effects of Delay on Choice

To apply Eq 2 to experiments in which an animal is choosing between delayed reinforcers

of different magnitudes (a) requires a scale that maps amount into reinforcing effectiveness

Perhaps the simplest “utility” function for reinforcement amount is the power function, which is the form assumed in the generalized matching law (Rachlin, 1971; Killeen, 1972; Baum, 1979)

It has the advantage of simplicity, and fits most of the available data over its limited range A disadvantage is that it has the effectiveness of reinforcement growing without bound as the amount is increased, which is implausible Rachlin has derived other forms for utility from first principles (1992); his logarithmic, and my (1985) exponential-integral can also accommodate data, as can Bradshaw and associates’ hyperbolic discounting of amount (Bezzina et al., 2007) However, the equations look simpler if we adopt the formalism of the generalized matching law

in which the reinforcing power of amount is the power function, u(a) = aα Then the associative

strength of a response immediately followed by a stimulus change, and d later a BSE of physical magnitude a, is the product of the impact of the BSE, aα, on the sum of the primary !s d and

secondary sd effects Assuming for parsimony that in the cases analyzed the relative salience of

stimulus elements and responses are comparable, then cprimary ≈ csecondary = c, and:

)

3.1 Methods of Adjustment

Psychophysical paradigms in which variables are adjusted to cause indifference in

preferences or other judgments—“Matching paradigms” (Farell and Pelli, 1999)—are more secure of interpretation than those involving a psychological scale, such as one of value (Hand, 2004; Uttal, 2000) Their units are physical measurements, and they refer to a unique

Trang 12

psychological point, that of equivalence This may be determined whether the underlying scale is interval, ordinal, or even nominal

How great must an amount a1 be to balance a different amount a2 at a different delay? Set

))= a2

)),

and solve for a1:

1/ α

(5)

Equation 5 gives the relative equivalent value of amount a2 delayed d 2, compared to an

alternative delayed d1 Typically, d1 is “immediate”—that is, around ½ s, and then Eq 5 gives

the relative immediate equivalent amount With d 2 > d 1, this ratio will be less than 1, indicating

that a smaller immediate amount, relative to a2, suffices to balance the latter at a remove of d 2 Note that neither amount appears in the right hand side; no magnitude effect is predicted: As long as the ratio of delays is the same, the predictions are the same when both amounts are

multiplied by a constant In general, no magnitude effect is found in delay discounting

experiments with non-human animals (Green et al., 2004; Ong and White, 2004) Figure 6 shows the course of Eq 5, with α = 1.26 and τ = 2.12 s, passing near the average data from four

pigeons in an experiment where the amount delivered after ½ s was adjusted to maintain

indifference between it and a larger amount (given by the parameter in the figure) delivered at a delay

Trang 13

The primary and conditioned reinforcing effects are highly correlated; Eq 5 may be

simplified by deleting the primary influence of the reinforcers on the choice responses, to yield:

1/ α

which draws the continuous curve through the data in Fig 6 But the primary and secondary effects may be dissociated, and when they are, alternatives with both are preferred to those with just primary reinforcement (Marcattilio and Richards, 1981; Lattal, 1984) The hyperbolic

approximation to Eq 6 provides a decent fit to these data as well, but falls noticeably farther from their average than do Eqs 5 and 6

In some matching experiments, the delay to one outcome is adjusted, rather than the amount

Eq 7 yields no simple prediction, but invoking the series expansion of the exponential termii that was used in going from Eq 2 to Eq 3:

s

d i ,a i ≈ a i

αc

1+ di/τ , leads to the simple linear relation of Eq 7

,,

-/

Operations that increase the sensitivity to reinforcement (increase α) or flatten the gradient

(increase τ) will increase the indifference point, d2 The provenance of the effect can be

determined by manipulating d1, as the former will increase both slope and intercept, and the latter only intercept Some drugs, such as stimulants, may decrease α while increasing τ

Trang 14

(Maguire et al., 2009; Pitts and Febbo, 2004), and their results will thus vary as a function of the

balance between the two, largely determined by the value of d1 A linear equation such as (7), based on multiplicative hyperbolic functions of amount and delay, was proposed and validated

by Mazur (2001), and independently by Bradshaw’s group (Ho et al., 1999; Bezzina et al., 2007;

da Costa Araújo et al., 2009) In Bradshaw’s model, as in Eq 7, the slope depends on relative payoffs regulated by the amount amplifier parameter α, and the intercept on a multiplicative function of that and delay sensitivity Their model has also been applied to human delay

discounting (Hinvest and Anderson, 2010; Liang et al., 2010)

3.2 Methods of Forced Choice

An alternative psychophysical procedure involves the measurement of the degree of

preference between two fixed alternatives, or the frequency of choosing one over the other Equation 4 may be rearranged to predict the outcome of choice experiments in which the delays and outcomes are invariant The relative associative strength of the alternatives is:

,,

-//

−1

(8)

In the case of unbiased choice there are two free parameters, the rate of diminishing

marginal utility for larger amounts, α, and the time constant of the memory trace, τ Note that amounts again appear as a ratio, indicating scale invariance: There is no magnitude effect Fig 7 shows this model follows a path similar to the data of Fox, Hand, and Reilly (2008), who asked whether rat models of ADHD (SHRs) would show steeper delay gradients than control (WKY) rats They did Other investigators (Adriani et al., 2003) did not find steeper gradients for SHR, but observed very large individual differences As noted by (Orduña, Hong, and Bouzas, 2007)

Trang 15

the main effect found by Fox and associates may be due to idiosyncrasies of their control rats (Sagvolden et al., 2009)

3 Discussion

Prospective judgments of equivalent amounts by humans, typical in the delay-discounting literature, require computations that are different in kind from those of paradigms in which real delays are conditioned to discriminative stimuli Humans can be instructed to contemplate the desirability of ten thousand dollars in ten years, and to stipulate how little they would settle for one week hence in lieu of it The performance entails a scale of future time, the value of an outcome deferred by that delay, and concatenation of the non-linear time-scale with a non-linear amount scale, from which a variety of results are imaginable (Killeen, 2009; Rachlin, 2006) Little wonder that there are differences in covering models The only way to so instruct other animals is to expose them to such realities repeatedly The assertion in the opening of this paper that the future cannot act on non-verbal animals was meant to emphasize this difference: On the one hand verbally presented unexperienced hypotheticals that can control human responses, and

on the other the conditioning of behavior reinforced by the presentation of conditioned

reinforcers signaling real, experienced, delays, that controls pigeon and rat behavior

This paper should be read as a grounding of hyperbolic models of delay discounting, not a critique of them It presented a few ideas First, it observed that Fig 1 is not a model of a

process It is a summary of some other kind of process, such as the one proposed in Fig 3 The distinction is important, as thinking of Fig 1 as a process can be misleading I am not alone in this concern:

In this [Fig 1] view, … reinforcers reach back in time to effect this response in the presence

of the remembered stimulus … As a model of how an animal adapts to, or learns about, situations

Tiêu đề	Models of Trace Decay, Eligibility for Reinforcement, and Delay of Reinforcement Gradients, From Exponential to Hyperboloid
Tác giả	Peter R. Killeen
Trường học	Arizona State University
Thể loại	Prepublication Preprint
Năm xuất bản	2011

Định dạng
Số trang	30
Dung lượng	0,94 MB