Specifically, the chapters cover the key developments in elementary cognitive mechanisms e.g., signal detection, information processing, reinforcement learning, basic cognitive skills e.
Trang 1The Oxford Handbook of
Trang 2The Oxford Handbook of Computational and
Mathematical Psychology
Trang 3Personality and Social Psychology
Kay Deaux and Mark Snyder
Trang 5Oxford University Press is a department of the University of
Oxford It furthers the University’s objective of excellence in research,
scholarship, and education by publishing worldwide.
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trademark of Oxford University Press
in the UK and certain other countries.
Published in the United States of America by
Oxford University Press
198 Madison Avenue, New York, NY 10016
c
Oxford University Press 2015
All rights reserved No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the prior
permission in writing of Oxford University Press, or as expressly permitted by law,
by license, or under terms agreed with the appropriate reproduction rights organization.
Inquiries concerning reproduction outside the scope of the above should be sent to the
Rights Department, Oxford University Press, at the address above.
You must not circulate this work in any other form
and you must impose this same condition on any acquirer.
Library of Congress Cataloging-in-Publication Data
Oxford handbook of computational and mathematical psychology / edited by Jerome R Busemeyer, Zheng Wang, James T Townsend, and Ami Eidels.
pages cm – (Oxford library of psychology)
Includes bibliographical references and index.
ISBN 978-0-19-995799-6
1 Cognition 2 Cognitive science 3 Psychology–Mathematical models.
4 Psychometrics I Busemeyer, Jerome R.
Trang 6Dedicated to the memory of
Dr William K Estes (1919–2011) and Dr R Duncan Luce (1925–2012)
Two of the founders of modern mathematical psychology
Trang 8S H O R T C O N T E N T S
Oxford Library of Psychology ix
Trang 10OX F O R D L I B R A R Y O F P S YC H O L O G Y
The Oxford Library of Psychology, a landmark series of handbooks, is published
by Oxford University Press, one of the world’s oldest and most highly respected publishers, with a tradition of publishing significant books in psychology The
ambitious goal of the Oxford Library of Psychology is nothing less than to span
a vibrant, wide-ranging field and, in so doing, to fill a clear market need.
Encompassing a comprehensive set of handbooks, organized hierarchically,
the Library incorporates volumes at different levels, each designed to meet a
distinct need At one level are a set of handbooks designed broadly to survey the major subfields of psychology; at another are numerous handbooks that cover important current focal research and scholarly areas of psychology in depth and
detail Planned as a reflection of the dynamism of psychology, the Library will
grow and expand as psychology itself develops, thereby highlighting significant new research that will impact on the field Adding to its accessibility and ease
of use, the Library will be published in print and, later on, electronically.
The Library surveys psychology’s principal subfields with a set of handbooks
that capture the current status and future prospects of those major plines The initial set includes handbooks of social and personality psychology, clinical psychology, counseling psychology, school psychology, educational psychology, industrial and organizational psychology, cognitive psychology, cognitive neuroscience, methods and measurements, history, neuropsychology, personality assessment, developmental psychology, and more Each handbook undertakes to review one of psychology’s major subdisciplines with breadth, comprehensiveness, and exemplary scholarship In addition to these broadly-
subdisci-conceived volumes, the Library also includes a large number of handbooks
designed to explore in depth more specialized areas of scholarship and research, such as stress, health and coping, anxiety and related disorders, cognitive development, or child and adolescent assessment In contrast to the broad coverage of the subfield handbooks, each of these latter volumes focuses on
an especially productive, more highly focused line of scholarship and research.
Whether at the broadest or most specific level, however, all of the Library
handbooks offer synthetic coverage that reviews and evaluates the relevant past and present research and anticipates research in the future Each handbook in
the Library includes introductory and concluding chapters written by its editor
to provide a roadmap to the handbook’s table of contents and to offer informed anticipations of significant future developments in that field.
Trang 11An undertaking of this scope calls for handbook editors and chapter authors who are established scholars in the areas about which they write Many of the nation’s and world’s most productive and best-respected psychologists have
agreed to edit Library handbooks or write authoritative chapters in their areas
in the Library the information they seek on the subfield or focal area of
psychology in which they work or are interested.
Befitting its commitment to accessibility, each handbook includes a comprehensive index, as well as extensive references to help guide research.
And because the Library was designed from its inception as an online as well
as print resource, its structure and contents will be readily and rationally
searchable online Further, once the Library is released online, the handbooks
will be regularly and thoroughly updated.
In summary, the Oxford Library of Psychology will grow organically to
provide a thoroughly informed perspective on the field of psychology, one that reflects both psychology’s dynamism and its increasing interdisciplinarity.
Once published electronically, the Library is also destined to become a
uniquely valuable interactive tool, with extended search and browsing capabilities, As you begin to consult this handbook, we sincerely hope you will share our enthusiasm for the more than 500-year tradition of Oxford University Press for excellence, innovation, and quality, as exemplified by the
Oxford Library of Psychology.
Peter E Nathan Editor-in-Chief
Oxford Library of Psychology
Trang 12A B O U T T H E E D I T O R S
Jerome R Busemeyeris Provost Professor of Psychology at Indiana University He
was the president of Society for Mathematical Psychology and editor of the Journal
of Mathematical Psychology His theoretical contributions include decision field
theory and, more recently, pioneering the new field of quantum cognition.
Zheng Wang is Associate Professor at the Ohio State University and directs the Communication and Psychophysiology Lab Much of her research tries
to understand how our cognition, decision making, and communication are contextualized.
James T Townsend is Distinguished Rudy Professor of Psychology at Indiana University He was the president of Society for Mathematical Psychology and editor
of the Journal of Mathematical Psychology His theoretical contributions include
systems factorial technology and general recognition theory.
Ami Eidels is Senior Lecturer at the School of Psychology, University of Newcastle, Australia, and a principle investigator in the Newcastle Cognition Lab His research focuses on human cognition, especially visual perception and attention, combined with computational and mathematical modeling.
Trang 14Department of Psychological and Brain Sciences
University of California, Santa Barbara
and Brain Sciences
Cognitive Science Program
Samuel J Gershman
Department of Brain and Cognitive SciencesMassachusetts Institute of TechnologyCambridge, MA
Thomas L Griffiths
Department of PsychologyUniversity of California, BerkeleyBerkeley, CA
Todd M Gureckis
Department of PsychologyNew York UniversityNew York, NY
Robert X D Hawkins
Department of Psychologicaland Brain SciencesIndiana UniversityBloomington, IN
Andrew Heathcote
School of PsychologyUniversity of NewcastleCallaghan, NSWAustralia
Marc W Howard
Department of Psychologicaland Brain SciencesCenter for Memory and BrainBoston University
Boston, MA
Brett Jefferson
Department of Psychologicaland Brain SciencesIndiana UniversityBloomington, IN
Michael N Jones
Department of Psychological and Brain SciencesIndiana University
Bloomington, IN
Trang 15Vanderbilt Vision Research Center
Center for Integrative and
Vanderbilt Vision Research Center
Center for Integrative and
Center for Adaptive Rationality (ARC)
Max Planck Institute for Human
Development
Berlin, Germany
Emmanuel Pothos
Department of PsychologyCity University LondonLondon, UK
Babette Rae
School of PsychologyUniversity of NewcastleCallaghan, NSWAustralia
Roger Ratcliff
Department of PsychologyThe Ohio State UniversityColumbus, OH
Tadamasa Sawada
Department of PsychologyHigher School of EconomicsMoscow, Russia
Jeffrey D Schall
Department of PsychologyVanderbilt Vision Research CenterCenter for Integrative andCognitive NeuroscienceVanderbilt UniversityNashville, TN
Philip Smith
School of Psychological SciencesThe University of MelbourneParkville, VIC
Australia
Fabian A Soto
Department of Psychologicaland Brain SciencesUniversity of California, Santa BarbaraSanta Barbara, CA
Joshua B Tenenbaum
Department of Brain and Cognitive SciencesMassachusetts Institute of TechnologyCambridge, MA
James T Townsend
Department of Psychologicaland Brain SciencesCognitive Science ProgramIndiana UniversityBloomington, IN
Joachim Vandekerckhove
Department of Cognitive SciencesUniversity of California, IrvineIrvine, CA
Wolf Vanpaemel
Faculty of Psychologyand Educational SciencesUniversity of LeuvenLeuven, Belgium
Trang 18C O N T E N T S
1 Review of Basic Mathematical Concepts Used in Computational and
Jerome R Busemeyer, Zheng Wang, Ami Eidels, and James T Townsend
Part I • Elementary Cognitive Mechanisms
F Gregory Ashby and Fabian A Soto
3 Modeling Simple Decisions and Applications Using a Diffusion
Roger Ratcliff and Philip Smith
4 Features of Response Times: Identification of Cognitive Mechanisms
Daniel Algom, Ami Eidels, Robert X D Hawkins, Brett Jefferson, and
James T Townsend
Todd M Gureckis and Bradley C Love
Part II • Basic Cognitive Skills
6 Why Is Accurately Labeling Simple Magnitudes So Hard? A Past,
Chris Donkin, Babette Rae, Andrew Heathcote, and Scott D Brown
7 An Exemplar-Based Random-Walk Model of Categorization and
Robert M Nosofsky and Thomas J Palmeri
Amy H Criss and Marc W Howard
Part III • Higher Level Cognition
Joseph L Austerweil, Samuel J Gershman, Joshua B Tenenbaum, and
Thomas L Griffiths
Trang 1910 Models of Decision Making under Risk and Uncertainty 209
Timothy J Pleskac, Adele Diederich, and Thomas S Wallsten
Michael N Jones, Jon Willits, and Simon Dennis
Tadamasa Sawada, Yunfeng Li, and Zygmunt Pizlo
Part IV • New Directions
John K Kruschke and Wolf Vanpaemel
Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers
Thomas J Palmeri, Jeffrey D Schall, and Gordon D Logan
16 Mathematical and Computational Modeling in
Richard W J Neufeld
Jerome R Busemeyer, Zheng Wang, and Emmanuel Pothos
Trang 20P R E FAC E
Computational and mathematical psychology has enjoyed rapid growth over
the past decade Our vision for the Oxford Handbook of Computational and
Mathematical Psychology is to invite and organize a set of chapters that review
these most important developments, especially those that have impacted— and will continue to impact—other fields such as cognitive psychology, developmental psychology, clinical psychology, and neuroscience Together with a group of dedicated authors, who are leading scientists in their areas,
we believe we have realized our vision Specifically, the chapters cover the key developments in elementary cognitive mechanisms (e.g., signal detection, information processing, reinforcement learning), basic cognitive skills (e.g., perceptual judgment, categorization, episodic memory), higher-level cognition (e.g., Bayesian cognition, decision making, semantic memory, shape perception), modeling tools (e.g., Bayesian estimation and other new model comparison methods), and emerging new directions (e.g., neurocognitive modeling, applications to clinical psychology, quantum cognition) in computation and mathematical psychology.
An important feature of this handbook is that it aims to engage readers with various levels of modeling experience Each chapter is self-contained and written by authoritative figures in the topic area Each chapter is designed to be
a relatively applied introduction with a great emphasis on empirical examples
(see New Handbook of Mathematical Psychology (2014) by Batchelder, Colonius,
Dzhafarov, and Myung for a more mathematically foundational and less applied presentation) Each chapter endeavors to immediately involve readers, inspire them to apply the introduced models to their own research interests, and refer them to more rigorous mathematical treatments when needed First, each chapter provides an elementary overview of the basic concepts, techniques, and models
in the topic area Some chapters also offer a historical perspective of their area
or approach Second, each chapter emphasizes empirical applications of the models Each chapter shows how the models are being used to understand human cognition and illustrates the use of the models in a tutorial manner Third, each chapter strives to create engaging, precise, and lucid writing that inspires the use
of the models.
The chapters were written for a typical graduate student in virtually any area of psychology, cognitive science, and related social and behavioral sciences, such as consumer behavior and communication We also expect it to be useful for readers ranging from advanced undergraduate students to experienced faculty members and researchers Beyond being a handy reference book, it should be beneficial as
Trang 21a textbook for self-teaching, and for graduate level (or advanced undergraduate level) courses in computational and mathematical psychology.
We would like to thank all the authors for their excellent contributions Also
we thank the following scholars who helped review the book chapters in addition
to the editors (listed alphabetically): Woo-Young Ahn, Greg Ashby, Scott Brown, Cody Cooper, Amy Criss, Adele Diederich, Chris Donkin, Yehiam Eldad, Pegah Fakhari, Birte Forstmann, Tom Griffiths, Andrew Heathcote, Alex Hedstrom, Joseph Houpt, Marc Howard, Matt Irwin, Mike Jones, John Kruschke, Peter Kvam, Bradley Love, Dora Matzke, Jay Myung, Robert Nosofsky, Tim Pleskac, Emmanuel Pothos, Noah Silbert, Tyler Solloway, Fabian Soto, Jennifer Trueblood, Joachim Vandekerckhove, Wolf Vanpaemel, Eric-Jan Wagenmakers, and Paul Williams The authors and reviewers’ effort ensure our confidence in the high quality of this handbook.
Finally, we would like to express how much we appreciate the outstanding assistance and guidance provided by our editorial team and production team at Oxford University Press The hard work provided by Joan Bossert, Louis Gulino, Anne Dellinger, A Joseph Lurdu Antoine and the production team of Newgen Knowledge Works Pvt Ltd., and others at the Oxford University Press are essential for the development of this handbook It has been a true pleasure working with this team!
Jerome R Busemeyer
Zheng Wang James T Townsend
Ami Eidels December 16, 2014
Trang 221 Review of Basic Mathematical Concepts Used in Computational and
expectations, maximum likelihood estimation
We have three ways to build theories to explain
and predict how variables interact and relate to each
other in psychological phenomena: using natural
verbal languages, using formal mathematics, and
using computational methods Human intuitive
and verbal reasoning has a lot of limitations For
example, Hintzman (1991) summarized at least
10 critical limitations, including our incapability
to imagine how a dynamic system works Formal
models, including both mathematical and
compu-tational models, can address these limitations of
human reasoning
Mathematics is a “radically empirical” science
(Suppes, 1984, p.78), with consistent and rigorous
evidence (the proof ) that is “presented with a
completeness not characteristic of any other area
of science” (p.78) Mathematical models can help
avoid logic and reasoning errors that are typically
encountered in human verbal reasoning The
complexity of theorizing and data often requires
the aid of computers and computational languages
Computational models and mathematical models
can be thought of as a continuum of a theorizing
process Every computational model is based on
a certain mathematical model, and almost every
mathematical model can be implemented as a
computational model
Psychological theories may start as a verbaldescription, which then can be formalized usingmathematical language and subsequently codedinto computational language By testing the modelsusing empirical data, the model fitting outcomescan provide feedback to improve the models, as well
as our initial understanding and verbal descriptions.For readers who are newcomers to this excitingfield, this chapter provides a review of basicconcepts of mathematics, probability, and statisticsused in computational and mathematical modeling
of psychological representation, mechanisms, andprocesses See Busemeyer and Diederich (2010) andLewandowsky and Ferrel (2010) for a more detailedpresentations
Mathematical Functions
Mathematical functions are used to map a set of
points called the domain of the function into a set
of points called the range of the function such that
only one point in range is assigned to each point
in the domain.1 As a simple example, the linear
function is defined as f (x)= a·x where the constant
a is the slope of a straight line In general, we use the notation f (x) to represent a function f that maps
a domain point x into a range point y = f (x) If a
Trang 23function f (x) has the property that each range point
y can only be reached by a single unique domain
point x, then we can define the inverse function
f−1(y) = x that maps each range point y = f (x)
back to the corresponding domain point x For
example, the quadratic function is defined as the
map f (x) = x2= x · x, and if we pick the number
x = 3.5, then f (3.5) = 3.52= 12.25 The quadratic
function is defined on a domain of both positive
and negative real numbers, and it does not have an
inverse because, for example, (− x)2= x2and so
there are two ways to get back from each range point
y to the domain However, if we restrict the domain
to the non-negative real numbers, then the inverse
of x2exists and it is the square root function defined
on non-negative real numbers √y=√x2= x There
are, of course, a large number of functions used
in mathematical psychology, but some of the most
popular ones include the following
The power function is denoted x a where the
variable x is a positive real number and the constant
a is called the power A quadratic function can be
obtained by setting a= 2 but we could instead
choose a= 0.50, which is the square root function
produces the reciprocal x−1 = 1
x, or we could
choose any real number such as a= 1.37 Using
a calculator, one finds that if x = 15.25 and a=
1.37, then 15.251.37 = 41.8658 One important
property to remember about power functions is that
x a · x b = x a +b and x b · y b = (x · y) band(x a ) b = x ab
Also note that x0= 1 Note that when working
with the power function, the variable x appears
in the base, and the constant a appears as the
power
The exponential function is denoted e x where
the exponent x is any real valued variable and
the constant base e stands for a special number
that is approximately e ∼= 2.7183 Sometimes it is
more convenient to use the notation e x = exp(x)
instead Using a calculator, we can calculate e2.5=
2.71832.5= 12.1825 Note that the exponent can
be negative −x < 0, in which case we can write
e x If x = 0, then e0= 1 The exponential
function always returns a positive value, e x > 0, and
it approaches zero as x approaches negative infinity.
More complex forms of the exponential are often
used For example, you will later see the function
e−
x −μ
σ
2
, where x is a variable and μ and σ are
constants In this case, it is more convenient to
you to first compute the squared deviation y=
e y = e x +yand(e x ) a = e a ·x In contrast to the power
function, the base of the exponential is a constantand the exponent is a variable
The (natural) log function is denoted ln (x) for positive values of x For example, using a calculator,
normally use the natural base e= 2.7183 If instead
we used base 10, then log10(10)= 1.) The log
function obeys the rules ln (x ·y) = ln(x)+ln(y) and
ln (x a)= a · ln(x) The log function is the inverse
of the exponential function: ln ( exp (x)) = x and
the exponential function is the inverse of the log
function exp ( log (x)) = x The function a x where a
is a constant and x is a variable can be rewritten in terms of the exponential function: define b = ln(a), then e bx = (e b)x = exp(ln(a)) x = a x
Figure 1.1 illustrates the power, exponential, andlog functions using different coefficient values forthe function As can be seen, the coefficient changesthe curve of the functions
Last but not least are the trigonometric functionsbased on a circle Figure 1.2 shows a circle with
its center located at coordinates (0, 0) in an (X , Y )
plane Now imagine a line segment of radius
r = 1 that extends from the center point to thecircumference of the circle This line segmentintersects with the circumference at coordinates
( cos (t · π),sin(t · π)) in the plane The coordinate
the circumference down onto the the X axis, and the point sin (t · π) is the projection of the point
on the circumference to the Y axis The variable
t (which, for example can be time) moves this
point around the circle, with positive values movingthe point counter clockwise, and negative values
equals one-half cycle around the circle, and 2π is
the period of time it takes to go all the way aroundonce The two functions are related by a translation(called the phase) in time: cos(t · π + (π/2)) = sin (t ·π) Note that cos is an even function because cos (t · π) = cos(−t · π), whereas sin is an odd
function because−sin(t · π) = sin( − t · π) Also
note that these functions are periodic in the sense
that for example cos (t · π) = cos(t · π + 2 · k · π) for any integer k We can generalize these two
functions by changing the frequency and the phase.For example, cos(ω · tπ + θ) is a cosine function
with a frequency ω (changing the time it takes
Trang 24Log function: y = log(x)
Fig 1.1 Examples of three important functions, with various parameter values From left to right: Power function, exponential
function, and log function See text for details.
–1 0
Fig 1.2 Left panel illustrates a point on a unit circle with a radius equal to one Vertical line shows sine, horizontal line shows cosine.
Right panel shows sine as a function of time The point on the Y axis of the right panel corresponds to the point on the Y axis of the
left panel.
to complete a cycle) and a phaseθ (advancing or
delaying the initial value at time t= 0)
Derivatives and Integrals
A derivative of a continuous function is the rate
of change or the slope of a function at some point
Suppose f (x) is some continuous function For a
small increment, the change in this function is
is the change divided by the increment df (x) =
f (x) −f (x−)
the derivative of the function at x, denoted as
d
derived in calculus (see Stewart (2012) or anycalculus textbook for an introduction to calculus).For example, in calculus it is shown that dx d e c ·x=
c · e c ·x, which says that the slope of the exponential
function at any point x is proportional to the
exponential function itself As another example, it
is shown in calculus that dx d x a = a · x a−1, which is
the derivative of the power function For example,
the derivative of a quadratic function a·x2is a linearfunction 2· a · x, and the derivative of the linear function a · x is the constant a The derivative of
the cosine function is dt d cos (t) = −sin(t) and the
derivative of sine is dt d sin (t) = cos(t).
Trang 25Fig 1.3 Illustration of the power function and its derivatives The lines in both panels mark the power-function line The slope of the
dotted line (the tangent to the function) is given by the derivative of that function (in this example, at x= 1).
Figure 1.3 illustrates the derivative of the power
function at the value x = 1 for two different
coefficients The curved line shows the power
function, and the straight line touches the curve at
x= 1 The slope of this line is the derivative
The integral of a continuous function is the area
under the curve within some interval (see Fig 1.4)
Suppose f (x) is a continuous function of x within
the interval [a, b] A simple way to approximate this
area is to divide the interval into N very small steps,
with a small increment being a step:
, and finally sum all the areas of the
rectangles to obtain an approximate area under the
As the number of intervals becomes arbitrarily large
and the increments get arbitrarily small so that
If we allow the upper limit of the integral to be a
variable, say z, then the integral becomes a function
of the upper limit, which can be written as F (z) =
z
of an integral? Let’s examine the change in the area
divided by the increment
theorem can then be used to find the integral of
a function For example, the integral ofz
0x a dx=
(a + 1)−1z a+1becausedz d (a + 1)−1z a+1= z a Theintegral ofz
of-ple, suppose V (t) represents the strength of a neural
connection between an input and an output at time
guiding the learning process A simple, discrete time
linear model of learning can be V (t) = (1 − α) ·
rate parameter We can rewrite this as a differenceequation:
= − α · V (t − 1) + α · x (t)
= − α · (V (t − 1) − x (t)).
This model states that the change in strength at time
t is proportional to the negative of the error signal,
which is defined as the difference between the
Trang 268 x 10
2 4 6
0
Fig 1.4 The integral of the function is the area under curve It can be approximated as the sum of the areas of the rectangles (left
panel) As the rectangles become narrower (middle), the sum of their areas converges to the true integral (right).
previous strength and the new reward If we wish
to describe learning as occurring more continuously
in time we can introduce a small time incrementt
into the model so that it states
= − α · t · (V (t − t) − x (t)),
which says that the change in strength is
propor-tional to the negative of the error signal, with the
constant of proportionality now modified by the
time increment Dividing both sides by the time
incrementt we obtain
dV (t)
t = −α · (V (t − t) − x (t)),
and now if we allow the time increment to approach
zero in the limit, t → 0, then the preceding
equation converges to a limit that is the differential
equation
d
which states that the rate of change in strength is
proportional to the negative of the error signal
Sometimes we can solve the differential equation
for a simple solution For example, the solution to
the equationdt d V (t) = −α · V (t) + c is V (t) = c
a−
e −α·t because when we substitute this solution back
into the differential equation, it satisfies the equality
of the differential equation
A stochastic difference equation is frequently used
in cognitive modeling to represent how a state
changes across time when it is perturbed by noise
For example, if we assume that the strength of
a connection changes according to the precedinglearning model, but with some noise (denoted
stochastic difference equation
Note that the noise is multiplied by√
t instead
required so that the effect of the noise does not
noise remains proportional tot (which is the key
characteristic of Brownian motion processes) SeeBattacharya and Waymire (2009) for an excellentbook on stochastic processes
Elementary Probability Theory
Probability theory describes how to assign abilities to events See Feller (1968) for a review ofprobability theory We start with a sample space that
prob-is a set denoted as , which contains all the unique
outcomes that can be realized For simplicity, wewill assume (unless noted otherwise) that the samplespace is finite (There could be a very large number
of outcomes, but the number is finite.) For example,
if a person takes two medical tests, test A and test
B, and each test can be positive or negative, thenthe sample space contains four mutually exclusiveand exhaustive outcomes: all four combinations ofpositive and negative test results from tests A and B.Figure 1.5 illustrates the situation for this simple
example An event, such as the event A (e.g., test A
is positive) is a subset of the sample space Suppose
for the moment that A, B are two events The disjunctive event A or B (e.g., test A is positive
or test B is positive) is represented as the union
A ∪B The conjunctive event A and B (e.g., test A is
Trang 27Test A
Test B
Negative Negative
Positive Positive
A
A ∩ B B
A ∩ B
Fig 1.5 Two ways to illustrate the probability space of events A and B The contingency table (left) and the Venn diagram (right)
correspond in the following way: Positive values on both tests in the table (the conjunctive event, A∩B) are represented by the overlap
of the circles in the Venn diagram Positive values on one test but not on the other in the table (the XOR event, A positive and B negative, or vice versa) are represented by the nonoverlapping areas of circles A and B Finally, tests that are both negative (upper left entry in the table) correspond in the Venn diagram to the area within the rectangle (the so-called “sample space”) that is not occupied
by any of the circles.
positive and test B is positive) is represented as the
intersection A ∩ B The impossible event (e.g., test
A is neither positive nor negative), denoted, is an
empty set The certain event is the entire sample
denoted A.
A probability function p assigns a number
between zero to one to each event The impossible
event is assigned zero, and the certain event
is assigned one The other events are assigned
probabilities 0 ≤ p(A) ≤ 1 and p ¯A
= 1 −
p(A) However, these probabilities must obey the
following additive rule: If A ∩ B = then p(A ∪
mutually exclusive so that A ∩ B = ? The answer
is called the “or”, which follows from the previous
Suppose we learn that some event A has occurred,
and now we wish to define the new probability
for event B conditioned on this known event.
The conditional probability p(B|A) stands for the
probability of event B given that event A has
occurred, which is defined as p (B|A) = p(A∩B) p(A)
Similarly, p(A|B) = p (A∩B)
p (B) is the probability of
event A given that B has occurred Using the
definition of conditional probability, we can then
define the “and” rule for joint probabilities as
follows: the probability of A and B equals p(A∩B) =
An important theorem of probability is called
Bayes rule It describes how to revise one’s beliefs
based on evidence Suppose we have two mutually
exclusive and exhaustive hypotheses denoted H1
and H2 For example H1could be a certain disease is
present and H2is the disease is not present Define
the event D as some observed data that provide
evidence for or against each hypothesis, such as a
medical test result Suppose p(D|H1) and p (D|H2)
are known These are called the likelihood’s of thedata for each hypothesis For example, medicaltesting would be used to determine the likelihood
of a positive versus negative test result when thedisease is known to be present, and the likelihood
of a positive versus negative test would also beknown when the disease is not present We define
p(H1) and p(H2) as the prior probabilities ofeach hypothesis For example, these priors may
be based on base rates for disease present ornot Then according to the conditional probabilitydefinition
p(H1|D) = p(H1)p(D|H1)
The last line is Bayes’ rule The probability p (H1|D)
is called the posterior probability of the hypothesisgiven the data It reflects the revision from theprior produced by the evidence from the data
If there are M ≥ 2 hypotheses, then the rule isextended to be
We often work with events that are assigned
to numbers A random variable is a function thatassigns real numbers to events For example, aperson may look at an advertisement and thenrate how effective it is on a nine-point scale Inthis case, there are nine mutually exclusive andexhaustive categories to choose from on the rating
Trang 28scale, and each choice is assigned a number (say,
1, 2, , or 9) Then we can define a random
variable X (R), which is a function that maps the
category event R onto one of the nine numbers.
For example, if the person chooses the middle
rating option, so that R = middle, then we assign
the event and instead write the random variable
simply as X For example, we can ask what is
the probability that the random variable is assigned
Then we assign a probability to each value of a
random variable by assigning it the probability of
the event that produces the value For example,
p(X= 5) equals the probability of the event that the
person picks the middle value Suppose the random
variable has N values
x1, x2, ,x i, ,x N Inour previous example with the rating scale, the
random variable had nine values The function
p(X = x i) (interpreted as the probability that
the person picks a choice corresponding to value
x i ) is called the probability mass function for the
random variable X This function has the following
Often we measure more than one random variable
For example, we could present an advertisement
and ask how effective it is for the participant
personally but also ask how effective the participant
believes it is for others Suppose X is the random
variable for the nine-point rating scale for self, and
let Y be the random variable for the nine-point
rating scale for others Then we can define a joint
probability that x j is selected for self and that y j is
selected for others These joint probabilities form
a two way 9× 9 table with with p(X = x i , Y = y j)
in each cell This joint probability function has the
Often, we work with random variables that have
a continuous rather than a discrete and finite
distri-bution, such as the normal distribution Suppose X
is a univariate continuous random variable In thiscase, the probability assigned to each real number iszero (there are uncountably infinite many of them
in any interval) Instead, we start by defining the
cumulative distribution function F (x) = p(X ≤ x) Then we define the probability density at each value
of x as the derivative of the cumulative probability function, f (x)= d
the density, we compute the probability of X falling
in some interval [a, b] as p (X ∈ [a,b]) =b
The increment f (x)· dx for the continuous random
variable is conceptually related to the mass function
2, where
μ is the mean of the distribution and σ is the
standard deviation of the distribution The normaldistribution is popular because of the central limittheorem, which states that the sample mean ¯X =
variable X will approach normal as the number
of samples becomes arbitrarily large, even if the
original random variable X is not normal We often
work with sample means, which we expect to beapproximately normal because of the central limittheorem
Expectations
When working with random variables, we areoften interested in their moments (i.e., their means,variances, or correlations) See Hogg and Craig(1970) for a reference on mathematical statistics.These different moments are different concepts, butthey are all defined as expected values of the randomvariables The expectation of the random variable
is defined as E [X ]=i p(X = x i ) · x i for the
Trang 29discrete case and it is defined as E [X ]=f (x) ·x·dx
for the continuous case The mean of a random
variable X , denoted μ Xis defined as the expectation
of the random variableμ X = E [X ] The variance
of a random variable X , denoted σ2
X, is defined asthe expectation of the squared deviation around the
mean:σ2
X = Var (X ) = E (X − μ)2 For example,
in the discrete case, this equalsσ2
(x i − μ X )2 The standard deviation is the square
root of the variance, σ X =σ2
X The covariance
between two random variables (X , Y ), denoted
σ XY, is defined by the expectation of the product
Often we need to combine two random variables
by a linear combination Z = a · X + b · Y , where
a, b are two constants For example, we may sum
two scores, a = 1 and b = 1, or take a difference
between two scores, a = 1,b = −1 There are two
important rules for determining the mean and the
variance of a linear transformation The expectation
operation is linear: E [a · X + b · Y ] = a · E [X ] +
b · E [Y ] The variance operator, however, is not
linear: var(a · X + b · Y ) = a2var(X ) + b2var(Y ) +
2ab · cov(X ,Y ).
Maximum Likelihood Estimation
Computational and mathematical models of
psychology contain parameters that need to be
estimated from the data For example, suppose a
person can choose to play or not play a slot machine
at the beginning of each trial The slot machine pays
the amount x(t) on trial t, but this amount is not
revealed until the trial is over Consider a simple
model that assumes that the probability of choosing
to gamble on trial t, denoted p(t), is predicted by
the following linear learning model
1+ e −β·V (t)
This model has two parametersα,β that must be
estimated from the data This is analogous to the
estimation problem one faces when using multiple
linear regression, where the linear regression is
the model and the regression coefficients are the
model parameters However, computational and
mathematical models, such as the earlier learning
model example, are nonlinear with respect to
the model parameters, which makes them morecomplicated, and one cannot use simple linearregression fitting routines
The model parameters are estimated from theempirical experimental data These experimentsusually consist of a sample of participants, and eachparticipant provides a series of responses to severalexperimental conditions For example, a study oflearning to gamble could obtain 100 choice trials
at each of 5 payoff conditions from each of 50participants One of the first issues for modeling
is the level of analysis of the data On the onehand, a group-level analysis would fit a model
to all the data from all the participants ignoringindividual differences This is not a good idea
if there are substantial individual differences Onthe other hand, an individual level analysis wouldfit a model to each individual separately allowingarbitrary individual differences This introduces anew set of parameters for each person, which isunparsimonious A hierarchical model applies themodel to all of the individuals, but it includes amodel for the distribution of individual differences.This is a good compromise, but it requires a goodmodel of the distribution of individual differences.Chapter 13 of this book describes the hierarchicalapproach Here we describe the basic ideas offitting models at the individual level using a methodcalled maximum likelihood (see Myung, 2003for a detailed tutorial on maximum likelihoodestimation) Also see Hogg and Craig (1970)for the general properties of maximum likelihoodestimates
Suppose we obtain 100 choice trials (gamble,not gamble) from 5 payoff conditions from eachparticipant The above learning model has twoparameters (α,β) that we wish to estimate using
[x1, x2, ,x t, ,x500] , where each x t is zero (notgamble) or one (gamble) If we pick values for thetwo parameters (α,β), then we can insert these into
our learning model and compute the probability
of gambling, p (t), for each trial from the model Define p(x t , t) as the probability that the model predicts the value x t observed on trial t For example, if x t = 1, then p(x t , t) = p(t), but if x t= 0,
then p(x t , t)=1− p(t), where recall that p(t) is
the predicted probability of choosing the gamble.Then we compute the likelihood of the observed
sequence of data D given the model parameters
Trang 300 50 100 150 200 250 300
Fig 1.6 Example of maximum likelihood estimation The histograms describe data sampled from a Gamma distribution with scale
and shape parameters both equal to 20 Using maximum likelihood we can estimate the parameter values of a Gamma distribution that best fits the sample (they turn out to be 20.4 and 19.5, respectively), and plot its probability density function (solid line).
To make this computationally feasible, we use the
log likelihood instead
This likelihood changes depending on our choice
in computational software, such as MATLAB, R,
Gauss, and Mathematica can be used to find the
maximum likelihood estimates The log likelihood
is a goodness-of-fit measure Higher values indicate
better fit Actually the computer algorithms find
the minimum of the badness of fit measure
G2= −2 · LnL(D|α,β).
Maximum likelihood is not restricted to learning
models and it can be used to fit all kinds of models
For example, if we observe a response time on each
trial, and our model predicts the response time
for each trial, then the preceding equation can be
applied with x tequal to the observed response time
on a trial, and with p(x t , t) equal to the predicted
probability for the observed value of response time
on that trial Figure 1.6 shows an example in which
a sample of response time data (summarized in
the figure by a histogram) was fit by a gamma
distribution model for response time using two
model parameters
Now suppose we have two different competing
learning models The model we just described has
two parameters Suppose the competing model is
quite different and it is more complex with four
parameters Also suppose the models are not nested
so that it is not possible to compute the same
predictions for the simpler model using the more
complex model Then we can compare models byusing a Bayesian information criterion (BIC, seeWasserman, 2000, for review) This criterion isderived on the basis of choosing the model that
is most probable given the data (However, thederivation only holds asymptotically as the samplesize increases indefinitely.) For each model wewish to compare, we can compute a BIC index:
BIC model = G2
model + n model · ln(N ), where n model
equals the number of model parameters estimated
The BIC index is an index that balances model fitwith model complexity as measured by number ofparameters (Note however, that model complexity
is more than the number of parameters, see Chapter13) It is a badness-of-fit index, and so we choosethe model with the lowest BIC index See Chapter
14 for a detailed review on model comparison
Trang 31perception In addition, we provide two chapters
on modeling tools, including Bayesian estimation
in hierarchical models and model comparison
methods We conclude the handbook with three
chapters on new directions in the field, including
neurocognitive modeling, mathematical and
com-putational modeling in clinical psychology, and
cognitive and decision models based upon quantum
probability theory
The models reviewed in the handbook make
use of many of the mathematical ideas presented
in this review chapter Probabilistic models
ap-pear in chapters covering signal detection theory
(Chapter 2), probabilistic models of cognition
(Chapter 9), decision theory (Chapters 10 and 17),
and clinical applications (Chapter 16)
Stochas-tic models (i.e., models that are dynamic and
probabilistic) appear in chapters covering
informa-tion processing (Chapter 4), percpetual judgment
(Chapter 6), and random walk/diffusion models
of choice and response time in various cogntive
tasks (Chapters 3, 7, 10, and 15) Learning and
memory models are reviewed in Chapters 5, 7, 8,
and 11 Models using vector spaces and geometry
are introduced in Chapters 11, 12, and 17
The basic concepts reviewed in this chapter
should be helpful for readers who are new to
math-ematical and computational models to jumpstart
reading the rest of the book In addition, each
chapter is self-contained, presents a tutorial style
introduction to the topic area exemplified by many
applications, and provides a specific glossary list ofthe basic concepts in the topic area We believe youwill have a rewarding reading experience
Note
1 This chapter is restricted to real numbers.
References
Battacharya, R N & Waymire, E C (2009) Stochastic processes
with applications (Vol 61) Philadelphia, PA: Siam.
Busemeyer, J R., & Diederich, A (2009) Cognitive modeling.
Thousand Oaks, CA: SAGE.
Cox, D R & Miller, H D (1965) The theory of stochastic
processes (Vol 134) Boca Raton, FL: CRC Press.
Feller, W (1968) on An introduction to probability theory and its
applications (3rd ed., Vol 1) New York, NY: Wiley.
Hintzman D.L (1991) Why are formal models useful in psychology? In W E Hockley & S Lewandowsky (Eds.),
Relating theory and data: Essays on human memory in honor of Bennet B Murdock (pp 39–56) Hillsdale, NJ: Erlbaum.
Hogg, R V., & Craig, A T (1970) Introduction to mathematical
statistics (3rd ed.) New York, NY: Macmillan.
Lewandowsky, S & Ferrel, S (2010) Computational modeling
in cognition: Principles and practice Thousand Oaks, CA:
SAGE.
Myung, I J (2003) Tutorial on maximum likelihood
estima-tion Journal of Mathematical Psychology, 47, 90–100 Stewart, J (2012) Calculus (7th ed.) Belmont, CA: Brooks/Cole Suppes, P (1984) Probabilistic metaphysics Oxford: Basil
Blackwell.
Wasserman, L (2000) Bayesian model selection and model
averaging Journal of Mathematical Psychology, 44(1), 92–
107.
Trang 32P A R T
I
Elementary Cognitive
Mechanisms
Trang 342 Multidimensional Signal Detection Theory
F Gregory Ashby and Fabian A Soto
Abstract
Multidimensional signal detection theory is a multivariate extension of signal detection
theory that makes two fundamental assumptions, namely that every mental state is noisyand that every action requires a decision The most widely studied version is known as
general recognition theory (GRT) General recognition theory assumes that the percept oneach trial can be modeled as a random sample from a multivariate probability distributiondefined over the perceptual space Decision bounds divide this space into regions that areeach associated with a response alternative General recognition theory rigorously definesand tests a number of important perceptual and cognitive conditions, including perceptualand decisional separability and perceptual independence General recognition theory hasbeen used to analyze data from identification experiments in two ways: (1) fitting and
comparing models that make different assumptions about perceptual and decisional
processing, and (2) testing assumptions by computing summary statistics and checking
whether these satisfy certain conditions Much has been learned recently about the neuralnetworks that mediate the perceptual and decisional processing modeled by GRT, and thisknowledge can be used to improve the design of experiments where a GRT analysis is
anticipated
perceptual independence, identification, categorization
Introduction
Signal detection theory revolutionized
psy-chophysics in two different ways First, it
in-troduced the idea that trial-by-trial variability in
sensation can significantly affect a subject’s
perfor-mance And second, it introduced the field to the
then-radical idea that every psychophysical response
requires a decision from the subject, even when
the task is as simple as detecting a signal in the
presence of noise Of course, signal detection theory
proved to be wildly successful and both of these
assumptions are now routinely accepted without
question in virtually all areas of psychology
The mathematical basis of signal detection
theory is rooted in statistical decision theory, which
itself has a history that dates back at least severalcenturies The insight of signal detection theoristswas that this model of statistical decisions was also
a good model of sensory decisions The first signaldetection theory publication appeared in 1954(Peterson, Birdsall, & Fox, 1954), but the theorydid not really become widely known in psychologyuntil the seminal article of Swets, Tanner, and
Birdsall appeared in Psychological Review in 1961.
From then until 1986, almost all applications ofsignal detection theory assumed only one sen-sory dimension (Tanner, 1956, is the principalexception) In almost all cases, this dimensionwas meant to represent sensory magnitude For
a detailed description of this standard univariate
Trang 35theory, see the excellent texts of either Macmillan
and Creelman (2005) or Wickens (2002) This
chapter describes multivariate generalizations of
signal detection theory
Multidimensional signal detection theory is a
multivariate extension of signal detection to cases in
which there is more than one perceptual dimension
It has all the advantages of univariate signal
detection theory (i.e., it separates perceptual and
decision processes) but it also offers the best existing
method for examining interactions among
percep-tual dimensions (or components) The most widely
studied version of multidimensional signal
detec-tion theory is known as general recognidetec-tion theory
(GRT; Ashby & Townsend, 1986) Since its
incep-tion, more than 350 articles have applied GRT to
a wide variety of phenomena, including
categoriza-tion (e.g., Ashby & Gott, 1988; Maddox & Ashby,
1993), similarity judgment (Ashby & Perrin, 1988),
face perception (Blaha, Silbert, & Townsend, 2011;
Thomas, 2001; Wenger & Ingvalson, 2002),
recog-nition and source memory (Banks, 2000; Rotello,
Macmillan, & Reeder, 2004), source monitoring
(DeCarlo, 2003), attention (Maddox, Ashby, &
Waldron, 2002), object recognition (Cohen, 1997;
Demeyer, Zaenen, & Wagemans, 2007),
per-ception/action interactions (Amazeen & DaSilva,
2005), auditory and speech perception (Silbert,
2012; Silbert, Townsend, & Lentz, 2009), haptic
perception (Giordano et al., 2012; Louw, Kappers,
& Koenderink, 2002), and the perception of sexual
interest (Farris, Viken, & Treat, 2010)
Extending signal detection theory to multiple
dimensions might seem like a straightforward
mathematical exercise, but, in fact, several new
conceptual problems must be solved First, with
more than one dimension, it becomes necessary
to model interactions (or the lack thereof ) among
those dimensions During the 1960s and 1970s,
a great many terms were coined that attempted
to describe perceptual interactions among separate
stimulus components None of these, however, were
rigorously defined or had any underlying theoretical
foundation Included in this list were perceptual
independence, separability, integrality, performance
parity, and sampling independence Thus, to be
useful as a model of perception, any multivariate
extension of signal detection theory needed to
provide theoretical interpretations of these terms
and show rigorously how they were related to one
another
Second, the problem of how to model decisionprocesses when the perceptual space is multidi-mensional is far more difficult than when there
is only one sensory dimension A standard detection-theory lecture is to show that almost anydecision strategy is mathematically equivalent tosetting a criterion on the single sensory dimension,then giving one response if the sensory value falls
signal-on signal-one side of this criterisignal-on, and the other respsignal-onse
if the sensory value falls on the other side Forexample, in the normal, equal-variance model, this
is true regardless of whether subjects base theirdecision on sensory magnitude or on likelihoodratio A straightforward generalization of this model
to two perceptual dimensions divides the perceptualplane into two response regions One response isgiven if the percept falls in the first region and theother response is given if the percept falls in thesecond region The obvious problem is that, unlike
a line, there are an infinite number of ways to divide
a plane into two regions How do we know which
of these has the most empirical validity?
The solution to the first of these two problems—that is, the sensory problem—was proposed byAshby and Townsend (1986) in the article thatfirst developed GRT The GRT model of sensoryinteractions has been embellished during the past
25 years, but the core concepts introduced by Ashbyand Townsend (1986) remain unchanged (i.e., per-ceptual independence, perceptual separability) Incontrast, the decision problem has been much moredifficult Ashby and Townsend (1986) proposedsome candidate decision processes, but at that timethey were largely without empirical support In theensuing 25 years, however, hundreds of studieshave attacked this problem, and today much isknown about human decision processes in percep-tual and cognitive tasks that use multidimensionalperceptual stimuli
Box 1 Notation
AiBj = stimulus constructed by setting
compo-nent A to level i and compocompo-nent B to level j
aibj= response in an identification experiment
signaling that component A is at level i and component B is at level j
X1= perceived value of component A
X2= perceived value of component B
Trang 36Box 1 Continued
f ij (x1,x2) = joint likelihood that the perceived
value of component A is x1 and the perceived
value of component B is x2on a trial when the
presented stimulus is AiBj
g ij (x1) = marginal pdf of component A on trials
when stimulus AiBjis presented
r ij = frequency with which the subject
re-sponded Rj on trials when stimulus Si was
presented
P(R j|Si) = probability that response Rjis given
on a trial when stimulus Siis presented
General Recognition Theory
General recognition theory (see the Glossary for
key concepts related to GRT) can be applied to
virtually any task The most common applications,
however, are to tasks in which the stimuli vary
on two stimulus components or dimensions As
an example, consider an experiment in which
participants are asked to categorize or identify faces
that vary across trials on gender and age Suppose
there are four stimuli (i.e., faces) that are created by
factorially combining two levels of each dimension
In this case we could denote the two levels of the
gender dimension by A1(male) and A2(female) and
the two levels of the age dimension by B1(teen) and
B2(adult) Then the four faces are denoted as A1B1
(male teen), A1B2(male adult), A2B1(female teen),
and A2B2(female adult)
As with signal detection theory, a fundamental
assumption of GRT is that all perceptual systems
are inherently noisy There is noise both in the
stimulus (e.g., photon noise) and in the neural
systems that determine its sensory representation
(Ashby & Lee, 1993) Even so, the perceived value
on each sensory dimension will tend to increase as
the level of the relevant stimulus component
in-creases In other words, the distribution of percepts
will change when the stimulus changes So, for
example, each time the A1B1face is presented, its
perceived age and maleness will tend to be slightly
different
General recognition theory models the sensory
or perceptual effects of a stimulus AiBjvia the joint
probability density function (pdf ) f ij (x1, x2) (see
Box 1 for a description of the notation used in this
article) On any particular trial when stimulus AiBj
is presented, GRT assumes that the subject’s percept
can be modeled as a random sample from this
joint pdf Any such sample defines an ordered pair
(x1, x2), the entries of which fix the perceived value
of the stimulus on the two sensory dimensions.General recognition theory assumes that the subjectuses these values to select a response
In GRT, the relationship of the joint pdf to themarginal pdfs plays a critical role in determiningwhether the stimulus dimensions are perceptually
integral or separable The marginal pdf g ij (x1)simply describes the likelihoods of all possible
sensory values of X1 Note that the marginal pdfsare identical to the one-dimensional pdfs of classicalsignal detection theory
Component A is perceptually separable from
component B if the subject’s perception of A doesnot change when the level of B is varied Forexample, age is perceptually separable from gender
if the perceived age of the adult in our faceexperiment is the same for the male adult as forthe female adult, and if a similar invariance holdsfor the perceived age of the teen More formally, in
an experiment with the four stimuli, A1B1, A1B2,
A2B1, and A2B2, component A is perceptuallyseparable from B if and only if
g11(x1)= g12(x1) and g21(x1)= g22(x1)
Similarly, component B is perceptually separablefrom A if and only if
g11(x2)= g21(x2) and g12(x2)= g22(x2), (2)
for all values of x2 If perceptual separability fails
then A and B are said to be perceptually integral.
Note that this definition is purely perceptual since
it places no constraints on any decision processes
Another purely perceptual phenomenon is ceptual independence According to GRT, compo-
per-nents A and B are perceived independently instimulus AiBj if and only if the perceptual value
of component A is statistically independent ofthe perceptual value of component B on AiBj
trials More specifically, A and B are perceivedindependently in stimulus AiBjif and only if
f ij (x1, x2)= g ij (x1) g ij (x2) (3)
for all values of x1 and x2 If perceptual pendence is violated, then components A and Bare perceived dependently Note that perceptualindependence is a property of a single stimulus,whereas perceptual separability is a property ofgroups of stimuli
inde-A third important construct from GRT is sional separability In our hypothetical experiment
Trang 37deci-with stimuli A1B1, A1B2, A2B1, and A2B2, and
two perceptual dimensions X1 and X2, decisional
separability holds on dimension X1(for example),
if the subject’s decision about whether stimulus
component A is at level 1 or 2 depends only on
the perceived value on dimension X1 A decision
bound is a line or curve that separates regions of the
perceptual space that elicit different responses The
only types of decision bounds that satisfy decisional
separability are vertical and horizontal lines
The Multivariate Normal Model
So far we have made no assumptions about
the form of the joint or marginal pdfs Our
only assumption has been that there exists some
probability distribution associated with each
stim-ulus and that these distributions are all embedded
in some Euclidean space (e.g., with orthogonal
dimensions) There have been some efforts to
extend GRT to more general geometric spaces
(i.e., Riemannian manifolds; Townsend, Aisbett,
Assadi, & Busemeyer, 2006; Townsend &
Spencer-Smith, 2004), but much more common is to
add more restrictions to the original version of
GRT, not fewer For example, some applications
of GRT have been distribution free (e.g., Ashby
& Maddox, 1994; Ashby & Townsend, 1986),
but most have assumed that the percepts are
multivariate normally distributed The multivariate
normal distribution includes two assumptions
First, the marginal distributions are all normal
Second, the only possible dependencies are pairwise
linear relationships Thus, in multivariate normal
distributions, uncorrelated random variables are
statistically independent
A hypothetical example of a GRT model that
assumes multivariate normal distributions is shown
in Figure 2.1 The ellipses shown there are contours
of equal likelihood; that is, all points on the same
ellipse are equally likely to be sampled from the
underlying distribution The contours of equal
likelihood also describe the shape a scatterplot of
points would take if they were random samples
from the underlying distribution Geometrically,
the contours are created by taking a slice through
the distribution parallel to the perceptual plane and
looking down at the result from above Contours
of equal likelihood in multivariate normal
distribu-tions are always circles or ellipses Bivariate normal
distributions, like those depicted in Figure 2.1 are
each characterized by five parameters: a mean on
each dimension, a variance on each dimension, and
a covariance or correlation between the values on
the two dimensions These are typically catalogued
in a mean vector and a variance-covariance matrix.For example, consider a bivariate normal distribu-
tion with joint density function f (x1,x2) Then themean vector would equal
σ1σ2)
The multivariate normal distribution has other important property Consider an identifica-tion task with only two stimuli and suppose theperceptual effects associated with the presentation
an-of each stimulus can be modeled as a multivariatenormal distribution Then it is straightforward toshow that the decision boundary that maximizesaccuracy is always linear or quadratic (e.g., Ashby,1992) The optimal boundary is linear if thetwo perceptual distributions have equal variance-covariance matrices (and so the contours of equallikelihood have the same shape and are just trans-lations of each other) and the optimal boundary isquadratic if the two variance-covariance matrices areunequal Thus, in the Gaussian version of GRT, theonly decision bounds that are typically consideredare either linear or quadratic
In Figure 2.1, note that perceptual independenceholds for all stimuli except A2B2 This can beseen in the contours of equal likelihood Notethat the major and minor axes of the ellipses thatdefine the contours of equal likelihood for stimuli
A1B1, A1B2, and A2B1 are all parallel to thetwo perceptual dimensions Thus, a scatterplot ofsamples from each of these distributions would becharacterized by zero correlation and, therefore,statistical independence (i.e., in the special Gaussiancase) However, the major and minor axes of the
A2B2 distribution are tilted, reflecting a positivecorrelation and hence a violation of perceptualindependence
Next, note in Figure 2.1 that stimulus nent A is perceptually separable from stimulus com-ponent B, but B is not perceptually separable from
compo-A To see this, note that the marginal distributionsfor stimulus component A are the same, regardless
of the level of component B [i.e., g11(x1) =
Trang 38Respond A2B1
Respond A1B2
Respond A1B1
Fig 2.1 Contours of equal likelihood, decision bounds, and
marginal perceptual distributions from a hypothetical
multi-variate normal GRT model that describes the results of an
identification experiment with four stimuli that were constructed
by factorially combining two levels of two stimulus dimensions.
g12(x1) and g21(x1)= g22(x1), for all values of x1]
Thus, the subject’s perception of component A does
not depend on the level of B and, therefore,
stimu-lus component A is perceptually separable from B
On the other hand, note that the subject’s
percep-tion of component B does change when the level
of component A changes [i.e., g11(x1) = g21(x1)
and g12(x1) = g22(x1) for most values of x1] In
particular, when A changes from level 1 to level 2
the subject’s mean perceived value of each level of
component B increases Thus, the perception of
component B depends on the level of component
A and therefore B is not perceptually separable
from A
Finally, note that decisional separability holds
on dimension 1 but not on dimension 2 On
dimension 1 the decision bound is vertical Thus,
the subject has adopted the following decision rule:
Component A is at level 2 if x1> X c1; otherwise
component A is at level 1
(i.e., the x1 intercept of the vertical decision
bound) Thus, the subject’s decision about whether
component A is at level 1 or 2 does not depend on
the perceived value of component B So component
A is decisionally separable from component B On
the other hand, the decision bound on dimension
x2is not horizontal, so the criterion used to judge
whether component B is at level 1 or 2 changes with
the perceived value of component A (at least for
larger perceived values of A) As a result, component
B is not decisionally separable from component A
Applying GRT to Data
The most common applications of GRT are
to data collected in an identification experimentlike the one modeled in Figure 2.1 The key datafrom such experiments are collected in a confusionmatrix, which contains a row for every stimulusand a column for every response (Table 2.1 displays
an example of a confusion matrix, which will be
discussed and analyzed later) The entry in row i and column j lists the number of trials on which
stimulus Si was presented and the subject gaveresponse Rj Thus, the entries on the main diagonalgive the frequencies of all correct responses and theoff-diagonal entries describe the various errors (orconfusions) Note that each row sum equals thetotal number of stimulus presentations of that type
So if each stimulus is presented 100 times thenthe sum of all entries in each row will equal 100.This means that there is one constraint per row, so
an n × n confusion matrix will have n × (n – 1)
degrees of freedom
General recognition theory has been used toanalyze data from confusion matrices in twodifferent ways One is to fit the model to theentire confusion matrix In this method, a GRTmodel is constructed with specific numerical values
of all of its parameters and a predicted confusionmatrix is computed Next, values of each parameterare found that make the predicted matrix as close
as possible to the empirical confusion matrix Totest various assumptions about perceptual and deci-sional processing—for example, whether perceptualindependence holds—a version of the model thatassumes perceptual independence is fit to the data
as well as a version that makes no assumptionsabout perceptual independence This latter versioncontains the former version as a special case (i.e.,
in which all covariance parameters are set to zero),
so it can never fit worse After fitting these twomodels, we assume that perceptual independence
is violated if the more general model fits nificantly better than the more restricted modelthat assumes perceptual independence The othermethod for using GRT to test assumptions aboutperceptual processing, which is arguably morepopular, is to compute certain summary statisticsfrom the empirical confusion matrix and then
sig-to check whether these satisfy certain conditionsthat are characteristic of perceptual separability or
Trang 39independence Because these two methods are so
different, we will discuss each in turn
It is important to note however, that
regard-less of which method is used, there are certain
nonidentifiabilities in the GRT model that could
limit the conclusions that are possible to draw from
any such analyses (e.g., Menneer, Wenger, & Blaha,
2010; Silbert & Thomas, 2013) The problems
identification data (i.e., when the stimuli are A1B1,
A1B2, A2B1, and A2B2) For example, Silbert and
Thomas (2013) showed that in 2× 2 applications
where there are two linear decision bounds that
do not satisfy decisional separability, there always
exists an alternative model that makes the exact
same empirical predictions and satisfies decisional
separability (and these two models are related by an
affine transformation) Thus, decisional separability
is not testable with standard applications of GRT to
2× 2 identification data (nor can the slopes of the
decision bounds be uniquely estimated) For several
reasons, however, these nonidentifiabilities are not
catastrophic
First, the problems don’t generally exist with
3× 3 or larger identification tasks In the 3 × 3
case the GRT model with linear bounds requires
at least 4 decision bounds to divide the perceptual
space into 9 response regions (e.g., in a tic-tac-toe
configuration) Typically, two will have a generally
vertical orientation and two will have a generally
horizontal orientation In this case, there is no
affine transformation that guarantees decisional
separability except in the special case where the
two vertical-tending bounds are parallel and the
two horizontal-tending bounds are parallel (because
parallel lines remain parallel after affine
transfor-mations) Thus, in 3 × 3 (or higher) designs,
decisional separability is typically identifiable and
testable
Second, there are simple experimental
2 identification experiment to test for decisional
separability In particular, switching the locations
of the response keys is known to interfere with
performance if decisional separability fails but not
if decisional separability holds (Maddox, Glass,
O’Brien, Filoteo, & Ashby, 2010; for more
infor-mation on this, see the section later entitled “Neural
Implementations of GRT”) Thus, one could add
100 extra trials to the end of a 2 × 2
identi-fication experiment where the response key
loca-tions are randomly interchanged (and participants
are informed of this change) If accuracy drops
significantly during this period, then decisionalseparability can be rejected, whereas if accuracy isunaffected then decisional separability is supported
using the newly developed GRT model with dividual differences (GRT-wIND; Soto, Vucovich,Musgrave, & Ashby, in press), which was patternedafter the INDSCAL model of multidimensionalscaling (Carroll & Chang, 1970) GRT-wIND is fit
in-to the data from all individuals simultaneously Allparticipants are assumed to share the same groupperceptual distributions, but different participantsare allowed different linear bounds and they areassumed to allocate different amounts of attention
to each perceptual dimension The model does notsuffer from the identifiability problems identified
by Silbert and Thomas (2013), even in the 2× 2case, because with different linear bounds foreach participant there is no affine transformationthat simultaneously makes all these bounds satisfydecisional separability
Fitting the GRT Model to Identification Data
computing the likelihood function
When the full GRT model is fit to identificationdata, the best-fitting values of all free parametersmust be found Ideally, this is done via the method
of maximum likelihood—that is, numerical values
of all parameters are found that maximize thelikelihood of the data given the model Let S1, S2, , Sn denote the n stimuli in an identification
experiment and let R1, R2, , Rn denote the n responses Let r ij denote the frequency with whichthe subject responded Rjon trials when stimulus Si
was presented Thus, r ij is the entry in row i and column j of the confusion matrix Note that the r ij
are random variables The entries in each row have a
multinomial distribution In particular, if P(R j|Si)
is the true probability that response Rj is given
on trials when stimulus Si is presented, then theprobability of observing the response frequencies
r i1 , r i2 , , r in in row i equals
n i!
r i1 !r i2!···r in!P(R1|Si)r i1 P(R2|Si)r i2 ···P(R n|Si)r in
(6)
stimulus Si was presented during the course of theexperiment The probability or joint likelihood ofobserving the entire confusion matrix is the product
Trang 40of the probabilities of observing each row; that is,
General recognition theory models predict that
P(R j|Si) has a specific form Specifically, they
predict that P(R j|Si) is the volume in the Rj
response region under the multivariate distribution
of perceptual effects elicited when stimulus Siis
pre-sented This requires computing a multiple integral
The maximum likelihood estimators of the GRT
model parameters are those numerical values of each
parameter that maximize L Note that the first term
in Eq 7 does not depend on the values of any
model parameters Rather it only depends on the
data Thus, the parameter values that maximize the
second term also maximize the whole expression
For this reason, the first term can be ignored during
the maximization process Another common
prac-tice is to take logs of both sides of Eq 7 Parameter
values that maximize L will also maximize any
monotonic function of L (and log is a monotonic
transformation) So, the standard approach is to
find values of the free parameters that maximize
estimating the parameters
In the case of the multivariate normal model,
the predicted probability P(R j|Si) in Eq 8 equals
the volume under the multivariate normal pdf
that describes the subject’s perceptual experiences
on trials when stimulus Si is presented over the
response region associated with response Rj To
estimate the best-fitting parameter values using a
standard minimization routine, such integrals must
be evaluated many times If decisional separability is
assumed, then the problem simplifies considerably
For example, under these conditions, Wickens
(1992) derived the first and second derivatives
necessary to quickly estimate parameters of the
model using the Newton-Raphson method Other
methods must be used for more general models that
do not assume decisional separability Ennis and
Ashby (2003) proposed an efficient algorithm for
evaluating the integrals that arise when fitting any
GRT model This algorithm allows the parameters
of virtually any GRT model to be estimated via
standard minimization software The remainder of
this section describes this method
The left side of Figure 2.2 shows a contour
of equal likelihood from the bivariate normal
distribution that describes the perceptual effects ofstimulus Si, and the solid lines denote two possibledecision bounds in this hypothetical task In Figure2.2 the bounds are linear, but the method worksfor any number of bounds that have any parametricform The shaded region is the Rjresponse region
Thus, according to GRT, computing P(R j|Si) isequivalent to computing the volume under the Si
perceptual distribution in the Rj response region.This volume is indicated by the shaded region inthe figure First note that any linear bound can bewritten in discriminant function form as
and c is a constant The discriminant function form
of any decision bound has the property that positivevalues are obtained if any point on one side of thebound is inserted into the function, and negativevalues are obtained if any point on the oppositeside is inserted So, for example, in Figure 2.2, the
constants b1, b2, and c can be selected so that h1(x)
< 0 for any point below the bound Similarly, for the h2bound, the constants can be selected so that
and h2(x) < 0 for any point to the left Note that
under these conditions, the Rj response region is
defined as the set of all x such that h1(x)> 0 and
normal (mvn) pdf for stimulus Si as mvn(µi, i),then
to transform the problem using a multivariate form
of the well-known z transformation Ennis and
Ashby proposed using the Cholesky transformation
Any random vector x that has a multivariate normal
distribution can always be rewritten as
where µ is the mean vector of x, z is a random
vector with a multivariate z distribution (i.e., a tivariate normal distribution with mean vector 0
mul-and variance-covariance matrix equal to the identity
... clinical psychology, andcognitive and decision models based upon quantum
probability theory
The models reviewed in the handbook make
use of many of the mathematical. .. choice trials
at each of payoff conditions from each of 50participants One of the first issues for modeling
is the level of analysis of the data On the onehand, a group-level analysis... conceptually related to the mass function
2, where
μ is the mean of the distribution and σ is the< /i>
standard deviation of the distribution The normaldistribution