NetPASS fea-tured principled design of proficiency, task, and evidence models and a Bayes... To try to avoid confusions with the distributions of the parameters of dis-tributions, the ter
Trang 2Statistics for Social and Behavioral Sciences
Trang 3Statistics for Social and Behavioral Sciences (SSBS) includes monographs andadvanced textbooks relating to education, psychology, sociology, political sci-ence, public policy, and law.
More information about this series at http://www.springer.com/series/3463
Trang 4Russell G Almond • Robert J Mislevy
Linda S Steinberg • Duanli Yan
David M Williamson
Bayesian Networks in Educational Assessment
2123
Trang 5Russell G Almond Duanli Yan
Statistics for Social and Behavioral Sciences
ISBN 978-1-4939-2124-9 ISBN 978-1-4939-2125-6 (eBook)
DOI 10.1007/978-1-4939-2125-6
Library of Congress Control Number: 2014958291
Springer New York Heidelberg Dordrecht London
c
Springer Science+Business Media New York 2015
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprint- ing, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adap- tation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and tion in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect
informa-to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Trang 6Forward into future times we go
Over boulders standing in out way
Rolling them aside because we know Others follow in our steps one day
Under deepest earth the gems are found Reaching skyward ’till we grasp the heights Climbing up to where the view surrounds Hidden valleys offer new delights
Inch by inch and yard by yard until Luck brings us to the hidden vale
Desiring a place to rest yet still
Returning home now to tell the tale Ever knowing when that day does come New hands will take up work left undone
Trang 7Bayesian Inference in Educational Assessments (BNinEA) is the direct issue
of two projects, and the descendant, cousin, or sibling of many more We aregrateful for all we have learned in these experiences from many collaboratorsand supporters over the years
The first direct ancestor is the series of workshops we have presented atthe annual meeting of the National Council on Measurement in Education(NCME) almost every year since 2001 We are grateful to NCME for thisopportunity and to ETS for support in developing the materials they havegranted us permission to use Workshop participants will recognize many con-cepts, algorithms, figures, and hands-on exercises from these sessions.Its second direct ancestor is the Portal project at Educational TestingService It was here that Linda, Russell, and Bob fleshed out the evidence-centered design (ECD) assessment framework, implemented it in an objectmodel and design system, and carried out the applications using Bayes nets
We are grateful to Henry Braun and Drew Gitomer, successive vice-presidents
of Research, and Len Swanson, head of New Product Development, for porting Portal Our collaborators included Brian Berenbach, Marjorie Biddle,Lou DiBello, Howie Chernik, Eddie Herskovits, Cara Cahallan Laitusis, JanLukas, Alexander Matukhin, and Peggy Redman
sup-Biomass (Chaps 14 and 15) was a ground-up demonstration of Portal,ECD design, standards-based science assessment, web-delivered interactivetesting, with automated scoring of inquiry investigations and Bayes net mea-surement models The Biomass team included Andy Baird, Frank Jenkins,our subject matter lead Ann Kindfield, and Deniz Senturk Our subject mat-ter consultants were Scott Kight, Sue Johnson, Gordon Mendenhall, CathrynRubin, and Dirk Vanderklein
The Networking Performance Skill System (NetPASS) project was anonline performance-based assessment activity for designing and troubleshoot-ing computer networks It was developed in collaboration with the Cisco Net-working Academy Program, and led by John Behrens of Cisco NetPASS fea-tured principled design of proficiency, task, and evidence models and a Bayes
Trang 8VIII Acknowledgements
net psychometric model using the methods described in BNinEA Team bers included Malcolm Bauer, Sarah DeMark, Michael Faron, Bill Frey, DennisFrezzo, Tara Jennings, Peggy Redman, Perry Reinert, and Ken Stanley TheNetPASS prototype was foundational for the Packet Tracer simulation sys-tem and Aspire game environment that Cisco subsequently developed, andmillions of Cisco Network Academy (CNA) students around the world haveused operationally to learn beginning network engineering skills
mem-The DISC scoring engine was a modular Bayes-net-based evidence lation package developed for the Dental Interactive Simulations Corporation(DISC), by the Chauncey Group International, ETS, and the DISC Scor-ing Team: Barry Wohlgemuth, DISC President and Project Director; LynnJohnson, Project Manager; Gene Kramer; and five core dental hygienist mem-bers, Phyllis Beemsterboer, RDH, Cheryl Cameron, RDH, JD, Ann Eshenaur,RDH, Karen Fulton, RDH, and Lynn Ray, RDH Jay Breyer was the ChaunceyGroup lead, and was instrumental in conducting the expert–novice studies andconstructing proficiency, task, and evidence models
accumu-Adaptive Content with Evidence-based Diagnosis (ACED) was the child of Valerie J Shute It had a large number of contributors including LarryCasey, Edith Aurora Graf, Eric Hansen, Waverly Hester, Steve Landau, PeggyRedman, Jody Underwood, and Diego Zapata-Rivera ACED developmentand data collection were sponsored by National Science Foundation Grant
brain-No 3013202 The complete ACED models and data are available online; seethe Appendix for details
Bob’s initial forays into applying Bayesian networks in educational ment were supported in part by grants from the Office of Naval Research(ONR) and from the National Center for Research on Evaluation, Standards,and Student Testing (CRESST) at the University of California at Los Ange-les We are grateful to Charles Davis, Project Officer of ONR’s Model-BasedMeasurement program and Eva Baker, Director of CRESST, for their sup-port Much of the work we draw on here appears in ONR and CRESSTresearch reports The findings and opinions expressed in BNinEA, however,
assess-do not reflect the positions or policies of ONR, the National Institute on dent Achievement, Curriculum, and Assessment, the Office of EducationalResearch and Improvement, or the U.S Department of Education
Stu-Some of the ideas in this book are based on Russell’s previous work ontheGraphical-Belief project Thanks to Doug Martin at StatSci for spon-soring that project as well as the NASA Small Business Innovation Research(SBIR) program for supporting initial development David Madigan made anumber of contributions to that work, particularly pointing out the impor-tance of the weight of evidence Graphical-Belief is based on the earlierwork on Belief while Russell was still a student at Harvard Art Dempsterand Augustine Kong both provided valuable advise for that work The work
of Glenn Shafer and David Schum in thinking about the representation ofevidence has been very useful as well Those contributions are documented inRussell’s earlier book
Trang 9Acknowledgements IX
Along with the NCME training sessions and working on NetPASS andDISC, David completed his doctoral dissertation on model criticism in Bayesnets in assessment at Fordham University under John Walsh, with Russelland Bob as advisors
Hydrive was an intelligent tutoring system for helping trainees learn totroubleshoot the hydraulics subsystems of the F-15 aircraft Drew Gitomerwas the Principal Investigator and Linda was the Project Manager Theproject was supported by Armstrong Laboratories of the US Air Force, underthe Project Officer Sherrie Gott Design approaches developed in Hydrivewere extended and formalized in ECD Bob and Duanli worked with Drewand Linda to create and test an offline Bayes net scoring model for Hydrive.Russell and Bob used drafts of BNinEA in classes at Florida State Univer-sity (FSU) and the University of Maryland, respectively We received muchhelpful feedback from students to clarify our ideas and sharpen our presen-tations Students at Maryland providing editorial and substantive contribu-tions included Younyoung Choi, Roy Levy, Junhui Liu, Michelle Riconscente,and Daisy Wise Rutstein Students at FSU, providing feedback and advice,included Mengyao Cui, Yuhua Guo, Yoon Jeon Kim, Xinya Liang, ZhongtianLin, Sicong Liu, Umit Tokac, Gertrudes Velasquez, Haiyan Wu, and Yan Xia.Kikumi Tatsuoka has been a visionary pioneer in the field of cognitiveassessment, whose research is a foundation upon which our work and thatmany others in the assessment and psychometric communities builds Weare grateful for her permission to use her mixed-number subtraction data inChaps 6 and 11
Brent Boerlage, of Norsys Software Corp., has supported the book in anumber of ways First and foremost, he has made the student version of Neticaavailable for free, which has been exceedingly useful in our classes and onlinetraining Second, he has offered general encouragement for the project andoffered to add some of our networks to his growing Bayes net library.Many improvements to a draft of the book resulted from rigorous attentionfrom the ETS review process We thank Kim Fryer, the manager of editingservices in the Research and Development division at ETS, Associate EditorsDan Eignor and Shelby Haberman, and the reviewers of individual chapters:Malcolm Bauer, Jianbin Fu, Aurora Graf, Shelby Haberman, Yue Jia, Feifei
Li, Johnny Lin, Ru Lu, Frank Rijmen, Zhan Shu, Sandip Sinharay, LawrenceSmith, Matthias von Davier, and Diego Zapata-Rivera
We thank ETS for their continuing support for BNinEA and the variousprojects noted above as well as support through Bob’s position as Frederic
M Lord Chair in Measurement and Statistics under Senior Vice-Presidentfor Research, Ida Lawrence We thank ETS for permission to use the figuresand tables they own and their assistance in securing permission for the rest,through Juana Betancourt, Stella Devries, and Katie Faherty
We are grateful also to colleagues who have provided support in more eral and pervasive ways over the years, including John Mark Agosta, MalcolmBauer, Betsy Becker, John Behrens, Judy Goldsmith, Geneva Haertel, Sidney
Trang 10gen-X Acknowledgements
Irvine, Kathryn Laskey, Roy Levy, Bob Lissitz, John Mazzeo, Ann Nicholson,Val Shute, and Howard Wainer
It has taken longer than it probably should have to complete Bayesian
Networks in Educational Assessment For their continuing encouragement
and support, we are indebted to our editors at Springer: John Kimmel, whobrought us in, and Jon Gurstelle and Hannah Bracken, who led us out
Trang 11Using This Book
An early reviewer urged us to think of this book not as a primer in Bayesiannetworks (there are already several good titles available, referenced in thisvolume), but to focus instead on the application: the process of building themodel Our early reviewers also thought that a textbook would be more usefulthan a monograph, so we have steered this volume in that particular way Inparticular, we have tried to make the book understandable to any reasonablyintelligent graduate students (and several of our quite intelligent graduatestudents have let us know when we got too obscure), as this should providethe broadest possible audience
In particular, most chapters include exercises at the end We have foundthrough both our classes and the NCME training sessions, that students donot learn from our lectures or writing (no matter how brilliant) but fromtrying to apply what they heard and read to new problems We would urgeall readers, even just the ones skimming to try the exercises Solutions areavailable from Springer or from the authors
Another thing we have found very valuable in using the volume ally is starting the students early with a Bayesian network tool AppendixA
education-lists several tools, and gives pointers to more Even in the early chapters,merely using the software as a drawing tool helps get students thinking aboutthe ideas Of course, student projects are an important part of any courselike this Many of the Bayes net collections used in the example are availableonline; AppendixAprovides the details
We have divided the book into three parts, which reflect different levels
of complexity PartI is concerned with the basics of Bayesian networks, ticularly developing the background necessary to understand how to use aBayesian network to score a single student It begins with a brief overview ofthe ECD The approach is key to understanding how to use Bayesian networks
par-as mepar-asurement models par-as an integral component of par-assessment design anduse from the beginning, rather than simply as a way to analyze data once it
is in hand (To do the latter is to be disappointed—and it is not the fault ofBayes nets!) It ends with Chap 7, which goes beyond the basics to start to
Trang 12XII Using This Book
describe how the Bayesian model supports inference more generally Part II
takes up the issue of the calibrating the networks using data from students.This is too complex a topic to cover in great depth, but this section exploresparameterizations for Bayesian networks, looks at updating models from dataand model criticism, and ends with a complete example Part III expandsfrom the focus on mechanics to embedding the Bayesian network in an assess-ment system Two chapters describe the conceptual assessment frameworkand the four-process delivery architecture of ECD in greater depth, showingthe intimate connections among assessment arguments, design structures, andthe function of Bayesian networks in inference Two more chapters are thendevoted to the implementation of Biomass, one of the first assessments to bedesigned from the ground up using ECD
When we started this project, it was our intention to write a ion volume about evidence-centered assessment design Given how long thisproject has taken, that second volume will not appear soon Chapters2, 12,and 13 are probably the best we have to offer at the moment Russell hasused them with some success as standalone readings in his assessment designclass Although ECD does not require Bayesian networks, it does involve alot of Bayesian thinking about evidence Readers who are primarily interested
compan-in ECD may find that readcompan-ing all of Part I and exploring simple Bayes netexamples helps deepen their understanding of ECD, then moving to Chaps.12
and 13 if they want additional depth, and the Biomass chapters to see theideas in practice
Several of our colleagues in the Uncertainty in Artificial Intelligence munity (the home of much of the early work on Bayesian Networks) havebemoaned the fact that most of the introductory treatises on Bayesian net-works fall short in the area of helping the reader translate between a specificapplication and the language of Bayesian networks Part of the challenge here
com-is that it com-is difficult to do thcom-is in the absence of a specific application Thcom-isbook starts to fill that gap One advantage of the educational application isthat it is fairly easy to understand (most people having been subjected toeducational assessment at least once in their lives) Although some of thelanguage in the book is specific to the field of education, much of the develop-ment in the book comes from the authors’ attempt to translate the language
of evidence from law and engineering to educational assessment We hope thatreaders from other fields will find ways to translate it to their own work aswell
In an attempt to create a community around this book, we have created
a Wiki for evidence-centered assessment design (http://ecd.ralmond.net/ecdwiki/ECD/ECD/) Specific material to support the book, including examplenetworks and data, are available at the same site (http://ecd.ralmond.net/BN/BN) We would like to invite our readers to browse the material there and
to contribute (passwords can be obtained from the authors)
Trang 13Random Variables
Random variables in formulae are often indicated by capital letters set in italic
type, e.g., X, while a value of the corresponding random variable is indicated
as a lowercase letter, e.g., x.
Vector-valued random variables and constants are set in boldface For
example, X is a vector valued random variable and x is a potential value for X.
Random variables in Bayesian networks with long descriptive names are
usually set in italic type when referenced in the text, e.g., RandomVariable.
If the long name consists of more than one word, capitalization is often used
to indicate word boundaries (so-called CamelCase)
When random variables appear in graphs they are often preceded by anicon indicating whether they are defined in the proficiency model or the evi-dence model Variables preceded by a circle,, are proficiency variables, while
variables preceded by a triangle,
, are defined locally to an evidence model.They are often but not always observable variables
The states of such random variables are given in typewriter font, e.g., Highand Low
Note that Bayesian statistics does not allow fixed but unknown quantities.For this reason the distinction between variable and parameter in classicalstatistics is not meaningful In this book, the term “variable” is used to refer
to a quantity specific to a particular individual taking the assessment and theterm “parameter” is used to indicate quantities that are constant across allindividuals
Sets
Sets of states and variables are indicated with curly braces, e.g.,{High, Medium,
Low} The symbol x ∈ A is used to indicate that x is an element of A The
Trang 14XIV Notation
elements inside the curly braces are unordered, so {A1, A2} = {A2, A1}.
The use of parenthesis indicates that the elements are ordered, so that
(A1, A2) = (A2, A1).
The symbols∪ and ∩ are used for the union and intersection of two sets.
If A and B are sets, then A ⊂ B is used to indicate that A is a proper subset
of B, while A ⊆ B also allows the possibility that A = B.
If A refers to an event, then A refers to the complement of the event; that
is, the event that A does not occur.
Ordered tuples indicating vector valued quantities are indicated with
parenthesis, e.g., (x1, , x k)
Occasionally, the states of a variable have a meaningful order The symbol
is used to state that one symbol is lower than the other Thus High Low.
The quantifier ∀x is used to indicate “for all possible values of x.” The
quantifier ∃x is used to indicate that an element x exists that satisfies the
Probability Distributions and Related Functions
The notation P(X) is used to refer to the probability of an event X It is also used to refer to the probability distribution of a random variable X with the
hope that the distinction will be obvious from context
To try to avoid confusions with the distributions of the parameters of
dis-tributions, the term law is used for a probability distribution over a parameter and the term distribution is used for the distribution over a random variable, although the term distribution is also used generically.
The notation P(X |Y ) is used to refer to the probability of an event X
given that another event Y has occurred It is also used for the collection of probability distributions for a random variable X given the possible instanti- ations of a random variable Y Again we hope that this loose use of notation
will be clear from context
If the domain of the random variable is discrete, then the notation p(X) is
used for the probability mass function If the domain of the random variable is
continuous, then the notation f (X) is used to refer to the probability density The notation E[g(X)] is used for the expectation of the function g(X) with respect to the distribution P(X) When it is necessary to emphasize
Trang 15Notation XV
the distribution, then the random variables are placed as a subscript Thus,
E X [g(X)] is the expectation of g(X) with respect to the distribution P(X) and E X|Y [g(X)] is the expectation with respect to the distribution P(X |Y ).
The notation Var(X) is used to refer to the variance of the random
vari-able X The Var(X) is a matrix giving the Var(X k) on the diagonal and the
covariance of X i and X j in the off-diagonal elements
If A and B are two events or two random variables, then the notation
A ⊥⊥ B and I(A|∅|B) is used to indicate that A is independent of B The
notations A ⊥⊥ B | C and I(A|C|B) indicate that A is independent of B
when conditioned on the value of C (or the event C).
The notation N(μ, σ2) is used to refer to a normal distribution with mean
μ and variance σ2; N+(μ, σ2) refers to the same distribution truncated at
zero (so the random variable is strictly positive) The notation Beta(a, b) is used to refer to a beta distribution with parameters a and b The notation Dirichlet(a1, , a K ) is used to refer to K-dimensional Dirichlet distribution with parameters a1, , a K The notation Gamma(a, b) is used for a gamma distribution with shape parameter a and scale parameter b.
The symbol ∼ is used to indicate that a random variable follows a
par-ticular distribution Thus X ∼ N(0, 1) would indicate that X is a random
variable following a normal distribution with mean 0 and variance 1
Note that Γ (n) = (n − 1)! when n is a positive integer.
The notation B(a, b) is used for the beta function:
Trang 16XVI Notation
The notation n
y
is used to indicate the combinatorial function (n−y)!y! n! .
The extended combinatorial function n
Usual Use of Letters for Indices
The letter i is usually used to index individuals.
The letter j is usually used to index tasks, with J being the total number
of tasks
The letter k is usually used to index states of a variable, with K being the total number of states The notation k[X] is an indicator which is 1 when the random variable X takes on the kth possible value, and zero otherwise.
If x = (x1, , x K ) is a vector, then x< k refers to the first k −1 elements of
x, (x1, , x k−1 ), and x > k refers to the last K − k elements (xk+1 , , x K)
They refer to the empty set when k = 1 or k = K The notation x −k refers
to all elements except the jth; that is, (x1, , x k−1 , x k+1 , , x K)
Trang 17Part I Building Blocks for Bayesian Networks
1 Introduction 3
1.1 An Example Bayes Network 4
1.2 Cognitively Diagnostic Assessment 7
1.3 Cognitive and Psychometric Science 11
1.4 Ten Reasons for Considering Bayesian Networks 14
1.5 What Is in This Book 16
2 An Introduction to Evidence-Centered Design 19
2.1 Overview 20
2.2 Assessment as Evidentiary Argument 21
2.3 The Process of Design 23
2.4 Basic ECD Structures 26
2.4.1 The Conceptual Assessment Framework 27
2.4.2 Four-Process Architecture for Assessment Delivery 34
2.4.3 Pretesting and Calibration 38
2.5 Conclusion 39
3 Bayesian Probability and Statistics: a Review 41
3.1 Probability: Objective and Subjective 41
3.1.1 Objective Notions of Probability 42
3.1.2 Subjective Notions of Probability 43
3.1.3 Subjective–Objective Probability 45
3.2 Conditional Probability 46
3.3 Independence and Conditional Independence 51
3.3.1 Conditional Independence 53
3.3.2 Common Variable Dependence 54
3.3.3 Competing Explanations 55
3.4 Random Variables 57
3.4.1 The Probability Mass and Density Functions 57
3.4.2 Expectation and Variance 60
Trang 18XVIII Contents
3.5 Bayesian Inference 62
3.5.1 Re-expressing Bayes Theorem 63
3.5.2 Bayesian Paradigm 63
3.5.3 Conjugacy 67
3.5.4 Sources for Priors 72
3.5.5 Noninformative Priors 74
3.5.6 Evidence-Centered Design and the Bayesian Paradigm 76 4 Basic Graph Theory and Graphical Models 81
4.1 Basic Graph Theory 82
4.1.1 Simple Undirected Graphs 83
4.1.2 Directed Graphs 83
4.1.3 Paths and Cycles 84
4.2 Factorization of the Joint Distribution 86
4.2.1 Directed Graph Representation 86
4.2.2 Factorization Hypergraphs 88
4.2.3 Undirected Graphical Representation 90
4.3 Separation and Conditional Independence 91
4.3.1 Separation and D-Separation 91
4.3.2 Reading Dependence and Independence from Graphs 93
4.3.3 Gibbs–Markov Equivalence Theorem 94
4.4 Edge Directions and Causality 95
4.5 Other Representations 97
4.5.1 Influence Diagrams 97
4.5.2 Structural Equation Models 99
4.5.3 Other Graphical Models 100
5 Efficient Calculations 105
5.1 Belief Updating with Two Variables 106
5.2 More Efficient Procedures for Chains and Trees 111
5.2.1 Propagation in Chains 112
5.2.2 Propagation in Trees 116
5.2.3 Virtual Evidence 119
5.3 Belief Updating in Multiply Connected Graphs 122
5.3.1 Updating in the Presence of Loops 122
5.3.2 Constructing a Junction Tree 123
5.3.3 Propagating Evidence Through a Junction Tree 134
5.4 Application to Assessment 135
5.4.1 Proficiency and Evidence Model Bayes Net Fragments 137 5.4.2 Junction Trees for Fragments 139
5.4.3 Calculation with Fragments 143
5.5 The Structure of a Test 145
5.5.1 The Q-Matrix for Assessments Using Only Discrete Items 146
5.5.2 The Q-Matrix for a Test Using Multi-observable Tasks 147 5.6 Alternative Computing Algorithms 149
Trang 19Contents XIX
5.6.1 Variants of the Propagation Algorithm 150
5.6.2 Dealing with Unfavorable Topologies 150
6 Some Example Networks 157
6.1 A Discrete IRT Model 158
6.1.1 General Features of the IRT Bayes Net 161
6.1.2 Inferences in the IRT Bayes Net 162
6.2 The “Context” Effect 166
6.3 Compensatory, Conjunctive, and Disjunctive Models 172
6.4 A Binary-Skills Measurement Model 178
6.4.1 The Domain of Mixed Number Subtraction 178
6.4.2 A Bayes Net Model for Mixed-Number Subtraction 180
6.4.3 Inferences from the Mixed-Number Subtraction Bayes Net 184
6.5 Discussion 190
7 Explanation and Test Construction 197
7.1 Simple Explanation Techniques 198
7.1.1 Node Coloring 198
7.1.2 Most Likely Scenario 200
7.2 Weight of Evidence 201
7.2.1 Evidence Balance Sheet 202
7.2.2 Evidence Flow Through the Graph 205
7.3 Activity Selection 209
7.3.1 Value of Information 209
7.3.2 Expected Weight of Evidence 213
7.3.3 Mutual Information 215
7.4 Test Construction 215
7.4.1 Computer Adaptive Testing 216
7.4.2 Critiquing 217
7.4.3 Fixed-Form Tests 220
7.5 Reliability and Assessment Information 224
7.5.1 Accuracy Matrix 225
7.5.2 Consistency Matrix 230
7.5.3 Expected Value Matrix 230
7.5.4 Weight of Evidence as Information 232
Part II Learning and Revising Models from Data 8 Parameters for Bayesian Network Models 241
8.1 Parameterizing a Graphical Model 241
8.2 Hyper-Markov Laws 244
8.3 The Conditional Multinomial—Hyper-Dirichlet Family 246
8.3.1 Beta-Binomial Family 247
Trang 20XX Contents
8.3.2 Dirichlet-Multinomial Family 248
8.3.3 The Hyper-Dirichlet Law 248
8.4 Noisy-OR and Noisy-AND Models 250
8.4.1 Separable Influence 254
8.5 DiBello’s Effective Theta Distributions 254
8.5.1 Mapping Parent Skills to θ Space 256
8.5.2 Combining Input Skills 257
8.5.3 Samejima’s Graded Response Model 260
8.5.4 Normal Link Function 263
8.6 Eliciting Parameters and Laws 267
8.6.1 Eliciting Conditional Multinomial and Noisy-AND 269
8.6.2 Priors for DiBello’s Effective Theta Distributions 272
8.6.3 Linguistic Priors 273
9 Learning in Models with Fixed Structure 279
9.1 Data, Models, and Plate Notation 279
9.1.1 Plate Notation 280
9.1.2 A Bayesian Framework for a Generic Measurement Model 282
9.1.3 Extension to Covariates 284
9.2 Techniques for Learning with Fixed Structure 287
9.2.1 Bayesian Inference for the General Measurement Model 288 9.2.2 Complete Data Tables 289
9.3 Latent Variables as Missing Data 297
9.4 The EM Algorithm 298
9.5 Markov Chain Monte Carlo Estimation 305
9.5.1 Gibbs Sampling 308
9.5.2 Properties of MCMC Estimation 309
9.5.3 The Metropolis–Hastings Algorithm 312
9.6 MCMC Estimation in Bayes Nets in Assessment 315
9.6.1 Initial Calibration 316
9.6.2 Online Calibration 321
9.7 Caution: MCMC and EM are Dangerous! 324
10 Critiquing and Learning Model Structure 331
10.1 Fit Indices Based on Prediction Accuracy 332
10.2 Posterior Predictive Checks 335
10.3 Graphical Methods 342
10.4 Differential Task Functioning 347
10.5 Model Comparison 350
10.5.1 The DIC Criterion 350
10.5.2 Prediction Criteria 353
10.6 Model Selection 354
10.6.1 Simple Search Strategies 355
10.6.2 Stochastic Search 356
Trang 21Contents XXI
10.6.3 Multiple Models 357
10.6.4 Priors Over Models 357
10.7 Equivalent Models and Causality 358
10.7.1 Edge Orientation 358
10.7.2 Unobserved Variables 358
10.7.3 Why Unsupervised Learning cannot Prove Causality 360
10.8 The “True” Model 362
11 An Illustrative Example 371
11.1 Representing the Cognitive Model 372
11.1.1 Representing the Cognitive Model as a Bayesian Network 372
11.1.2 Representing the Cognitive Model as a Bayesian Network 377
11.1.3 Higher-Level Structure of the Proficiency Model; i.e., p( θ|λ) and p(λ) 379
11.1.4 High Level Structure of the Evidence Models; i.e., p(π) 381 11.1.5 Putting the Pieces Together 382
11.2 Calibrating the Model with Field Data 382
11.2.1 MCMC Estimation 383
11.2.2 Scoring 389
11.2.3 Online Calibration 392
11.3 Model Checking 397
11.3.1 Observable Characteristic Plots 398
11.3.2 Posterior Predictive Checks 401
11.4 Closing Comments 405
Part III Evidence-Centered Assessment Design 12 The Conceptual Assessment Framework 411
12.1 Phases of the Design Process and Evidentiary Arguments 414
12.1.1 Domain Analysis and Domain Modeling 414
12.1.2 Arguments and Claims 418
12.2 The Student Proficiency Model 424
12.2.1 Proficiency Variables 424
12.2.2 Relationships Among Proficiency Variables 428
12.2.3 Reporting Rules 433
12.3 Task Models 438
12.4 Evidence Models 443
12.4.1 Rules of Evidence (for Evidence Identification) 444
12.4.2 Statistical Models of Evidence (for Evidence Accumulation) 448
12.5 The Assembly Model 453
12.6 The Presentation Model 458
Trang 22XXII Contents
12.7 The Delivery Model 46012.8 Putting It All Together 461
13 The Evidence Accumulation Process 467
13.1 The Four-Process Architecture 46813.1.1 A Simple Example of the Four-Process Framework 47113.2 Producing an Assessment 47413.2.1 Tasks and Task Model Variables 47413.2.2 Evidence Rules 47813.2.3 Evidence Models, Links, and Calibration 48613.3 Scoring 48813.3.1 Basic Scoring Protocols 48913.3.2 Adaptive Testing 49313.3.3 Technical Considerations 49713.3.4 Score Reports 500
14 Biomass: An Assessment of Science Standards 507
14.1 Design Goals 50714.2 Designing Biomass 51014.2.1 Reconceiving Standards 51014.2.2 Defining Claims 51314.2.3 Defining Evidence 51414.3 The Biomass Conceptual Assessment Framework 51514.3.1 The Proficiency Model 51514.3.2 The Assembly Model 51914.3.3 Task Models 52314.3.4 Evidence Models 52914.4 The Assessment Delivery Processes 53514.4.1 Biomass Architecture 53614.4.2 The Presentation Process 53814.4.3 Evidence Identification 54014.4.4 Evidence Accumulation 54114.4.5 Activity Selection 54314.4.6 The Task/Evidence Composite Library 54314.4.7 Controlling the Flow of Information Among the
Processes 54414.5 Conclusion 545
15 The Biomass Measurement Model 549
15.1 Specifying Prior Distributions 55015.1.1 Specification of Proficiency Variable Priors 55215.1.2 Specification of Evidence Model Priors 55415.1.3 Summary Statistics 56015.2 Pilot Testing 56115.2.1 A Convenience Sample 561
Trang 23Contents XXIII
15.2.2 Item and other Exploratory Analyses 56415.3 Updating Based on Pilot Test Data 56615.3.1 Posterior Distributions 56615.3.2 Some Observations on Model Fit 57515.3.3 A Quick Validity Check 57715.4 Conclusion 579
16 The Future of Bayesian Networks in Educational
Assessment 583
16.1 Applications of Bayesian Networks 58316.2 Extensions to the Basic Bayesian Network Model 58616.2.1 Object-Oriented Bayes Nets 58616.2.2 Dynamic Bayesian Networks 58816.2.3 Assessment-Design Support 59216.3 Connections with Instruction 59316.3.1 Ubiquitous Assessment 59416.4 Evidence-Centered Assessment Design and Validity 59616.5 What We Still Do Not Know 597
A Bayesian Network Resources 601
A.1 Software 601A.1.1 Bayesian Network Manipulation 602A.1.2 Manual Construction of Bayesian Networks 603A.1.3 Markov Chain Monte Carlo 603A.2 Sample Bayesian Networks 604
References 607 Author Index 639
Trang 24List of Figures
1.1 A graph for the Language Testing Example 52.1 The principle design objects of the conceptual assessment
framework (CAF) 272.2 The proficiency model for a single variable, Proficiency Level 28
2.3 The measurement model for a dichotomously scored item 312.4 The four principle processes in the assessment cycle 353.1 Canonical experiment: balls in an urn 423.2 Graph for Feller’s accident proneness example 553.3 Unidimensional IRT as a graphical model 563.4 Variables θ1 and θ2are conditionally dependent given X 56
3.5 Examples of discrete and continuous distributions a Discrete distribution b Continuous distribution 59
3.6 Likelihood for θ generated by observing 7 successes in 10 trials 65
3.7 A panel of sample beta distributions 683.8 A panel of sample gamma distributions 734.1 A simple undirected graph 834.2 A directed graph 844.3 A tree contains no cycles 854.4 Examples of cyclic and acyclic directed graphs 854.5 Filling-in edges for triangulation Without the dotted edge,
this graph is not triangulated Adding the dotted edge makesthe graph triangulated 864.6 Directed Graph for
P(A)P(B)P(C |A, B)P(D|C)P(E|C)P(F |E, D) 87
4.7 Example of a hypergraph (a) and its 2-section (b) 89
4.8 Hypergraph representing
P(A)P(B)P(C |A, B)P(D|C)P(E|C)P(F |E, D) 89
4.9 2-section of Fig 4.8 90
Trang 25XXVI List of Figures
4.10 D-Separation 934.11 Directed graph running in the “causal” direction: P(Skill)
5.2 The acyclic digraph and junction tree for a four-variable chain 1125.3 A junction tree corresponding to a singly-connected graph 1195.4 A polytree and its corresponding junction tree 1195.5 A loop in a multiply connected graph 1245.6 The tree of cliques and junction tree for Figure 5.5 1245.7 Acyclic digraph for two-skill example (Example 5.5) 1255.8 Moralized undirected graph for two-skill example (Example 5.5) 1295.9 Two ways to triangulate a graph with a loop 1305.10 Cliques for the two-skill example 1305.11 Junction tree for the two-skill example 1325.12 Relationship among proficiency model and evidence model
Bayes net fragments 1385.13 Total acyclic digraph for three-task test 1415.14 Proficiency model fragments for three-task test 1415.15 Evidence model fragments for three-task test 1415.16 Moralized proficiency model graph for three-task test 1425.17 Moralized evidence model fragments for three-task test 1426.1 Model graph for five item IRT model 1606.2 The initial probabilities for the IRT model in Netica The
numbers at the bottom of the box for the Theta node
represent the expected value and standard deviation of Theta 1626.3 a Student with Item 2 and 3 correct b Student with Item 3
and 4 correct 1646.4 Probabilities conditioned on θ = 1 165
6.5 Five item IRT model with local dependence 1686.6 a Student with Item 2 and Item 3 correct with context effect.
b Student with Item 3 and Item 4 correct with context effect 169
6.7 Three different ways of modeling observable with two parents 1736.8 Initial probabilities for three distribution types 1746.9 a Updated probabilities when Observation = Right.
b Updated probabilities when Observation = Wrong 175
6.10 a Updated probabilities when P1 = H and
Observation = Right.
b Updated probabilities when P1 = H and Observation = Wrong 176
Trang 26List of Figures XXVII
6.11 a Updated probabilities when P1 = M and
Observation = Right.
b Updated probabilities when P1 = M and Observation = Wrong 177
6.12 Proficiency model for Method B for solving mixed number
subtraction 1816.13 Two evidence model fragments for evidence models 3 and 4 1826.14 Full Bayesian model for Method B for solving mixed numbersubtraction 1836.15 Prior (population) probabilities 1856.16 Mixed number subtraction: a sample student 1866.17 Mixed number subtraction: posterior after 7 Items 1886.18 Mixed number subtraction: Skill 1 = Yes 189
6.19 Mixed number subtraction: Skill 1 = No 191
6.20 Mixed number subtraction: Skill 1 = Yes, Skill 2 = No 192
6.21 Updated probabilities when P1 = L and Observation = Wrong 194
7.1 Colored graph for an examinee who reads well but has troublespeaking 1997.2 Evidence balance sheet for θ ≥ 1 (Example 7.2) 203
7.3 Evidence balance sheet for Reading = Novice (Example 7.1) 204
7.4 Evidence balance sheet for Listening = Novice (Example 7.1) 206
7.5 Evidence balance sheet for Speaking = Novice (Example 7.1) 207
7.6 Evidence flows using weight of evidence 2087.7 Proficiency model for ACED assessment 2197.8 Subset of ACED proficiency model for exercises 2348.1 A simple latent class model with two skills 2428.2 Hypergraph for a simple latent class model with two skills 2428.3 Second layer of parameters on top of graphical model 2438.4 Introducing the demographic variable Grade breaks global
parameter independence 2458.5 A conjunctive model, with no “noise” 2508.6 A conjunctive model with noisy output (DINA) 2518.7 A conjunctive model with noisy inputs (NIDA) 2528.8 A noisy conjunctive model, with noisy inputs and outputs 2538.9 Separable influences 2558.10 Midpoints of intervals on the normal distribution 2568.11 Probabilities for Graded Response model 2628.12 Output translation method 2649.1 Expanded and plate digraphs for four Bernoulli variables 2819.2 Plate digraph for hierarchical Beta-Bernoulli model 2829.3 Plate digraph for hierarchical Rasch model 2839.4 Graph for generic measurement model 2869.5 Graph with covariates and no DIF 287
Trang 27XXVIII List of Figures
9.6 Graph with DIF 2889.7 Graph for mixture model 2889.8 Graph for three-item latent class model 2929.9 Examples of poor mixing and good mixing 3119.10 Convergence of three Markov chains 3119.11 Plot of the Brook–Gelman–Rubin R vs cycle number 312
9.12 A Metropolis step that will always be accepted 3149.13 A Metropolis step that might be rejected 3149.14 Posteriors for selected parameters 32110.1 Two alternative discrete IRT models with and without context
effect a Conditionally independent IRT model b Testlet model 339
10.2 Stem-and-leaf plots of posterior predictive probabilities for Q3
values of item pairs 34110.3 Q3 values for item pairs 341
10.4 Observable characteristic plot 34410.5 Observable characteristic plot for additional skill 34510.6 Direct data display for mixed number subtraction test 34710.7 Two graphs showing the effects of differential task functioning
a Graph with no DTF b Graph with DTF 348
10.8 Graphs (a), (b), and (c) have identical independence
structures, but Graph (d) does not 35910.9 Four graphs which could have identical distributions on the
observed variables A and C a No hidden cause, b common
cause, c intermediate cause, d partial cause 359
10.10 Selection effect produces apparent dependence among observed
variables a No selection effect b Selection effect 360
10.11 A minimal graph which should not be interpreted as causal 36110.12 Inclusion of an additional variable changes picture dramatically 36210.13 Two candidate models for a hypothetical medical licensure
assessment a Model A b Model B 363
10.14 Observable characteristic plot for Exercise 10.13 36810.15 Observable characteristic plot for Exercise 10.14 36811.1 Proficiency model for mixed-number subtraction, method B 37311.2 Evidence model fragments for evidence models 3 and 4 37511.3 Mixed number subtraction Bayes net 37511.4 Plate representation of the parameterized mixed-number
subtraction model 38211.5 Gelman–Rubin potential scale reduction factors for selected
parameters 38611.6 Histories of MCMC chains for selected parameters 38711.7 Posterior distributions from MCMC chains for selected
parameters 38911.8 Observable characteristic plots for first eight items 399
Trang 28List of Figures XXIX
11.9 Observable characteristic plots for last seven items 40012.1 The principal design objects of the CAF 41312.2 Toulmin’s structure for arguments 42012.3 Partial language testing proficiency model showing a part-ofrelationship 42912.4 Proficiency model for language testing using only the four
modal skills 43012.5 Proficiency model for language with additional communicativecompetence skills 43112.6 Prototype score report for Language Placement Test 43612.7 Evidence model for lecture clarification task for use with
modal proficiency model 45013.1 The four-process architecture for assessment delivery 47013.2 Assessment blueprint for a small test 47413.3 Tasks generated from CAF in Fig 13.2 47513.4 A sample receiver operating characteristic (ROC) curve 48013.5 Link model generated from Task 1a in Fig 13.3 48713.6 Absorbing evidence from Task 1a for a single candidate 49013.7 ACED proficiency model 49514.1 A concept map for mechanisms of evolution 51114.2 Representation of a standards-based domain for assessment
design 51214.3 The biomass proficiency model 51614.4 Biomass: a sample score report from the Biomass classroom
assessment 52014.5 Biomass: the introductory screen 52314.6 Biomass: background for first task 52614.7 Biomass: first task is to complete a table for allele
representation of mode of inheritance 52714.8 Biomass: second task, population attribute table 52814.9 Biomass: fourth task, what to do next? 52914.10 An evidence model Bayes net fragment with seven observables 53314.11 An evidence model using three observables and a context effect 53414.12 An evidence model fragment with three conditionally-
independent observables 53414.13 An evidence model fragment using the inhibitor relationship 53516.1 A basic dynamic Bayesian network 58916.2 A learning-model dynamic Bayesian network 59016.3 Instruction as a partially observed Markov decision process 59116.4 Influence diagram for skill training decision 593
Trang 29List of Tables
5.1 Updating probabilities in response to learning Y = y1 109
5.2 Numerical example of updating probabilities 1105.3 Updating probabilities down a chain 1175.4 Updating probabilities up a chain 1185.5 Updating with virtual evidence 1215.6 Probabilities for Example 5.5 1275.7 Potential tables for the two-skill example 1325.8 Updating the potential tables for{θA , θ B , X2} 136
5.9 Q-Matrix for design leading to saturated model 147
5.10 Q-Matrix for Fig 5.13, one row per task 148
5.11 Q-Matrix for Fig 5.13, one row per observable 148
5.12 Q-Matrix for proposed assessment (Exercise 5.12) 154
6.1 Conditional probabilities of a correct response for the five-itemIRT model 1606.2 Initial marginal probabilities for five items from IRT model 1626.3 New potentials for Item 3 and Item 4, conditioned on Context 170
6.4 Conditional probabilities for the three distributions 1746.5 Q-Matrix for the Tatsuoka (1984) mixed number subtraction
test 1796.6 Conditional probability table for Skill 5 184
6.7 Conditional probability table CPT for Item 16 184
6.8 Posteriors after two sets of observations 1876.9 Predictions for various skill patterns Subtraction assessment 1906.10 Potentials for Exercise 6.17 1957.1 Accuracy matrices for Reading and Writing based on 1000
simulated students 2277.2 Expected accuracy matrices based on 1000 simulations 2327.3 Conditional probabilities for ACED subset proficiency model 2347.4 Conditional probabilities for evidence models for ACED subset 234
Trang 30XXXII List of Tables
7.5 Alternative conditional probabilities for ACED subset
proficiency model 2357.6 Accuracy matrices for Speaking and Listening based on 1000
simulated students 2378.1 Effective thetas for a compensatory combination function 2588.2 Effective thetas for the conjunctive and disjunctive
combination functions 2598.3 Effective thetas for inhibitor combination functions 2608.4 Conditional probability table for simple graded response model 2638.5 Compensatory combination function and graded response linkfunction 2638.6 Conditional probability table with normal link function,
correlation = 0.8 2668.7 Conditional probability table for path coefficients 0.58 and 0.47 2679.1 Special cases of the generic measurement model 2859.2 Prior and posterior statistics for beta distribution with r
successes in n trials 291
9.3 Response pattern counts with proficiency variable, θ 295
9.4 Response pattern counts collapsing over proficiency variable, θ 295
9.5 E-step probabilities for iterations 1, by response pattern 3049.6 E-step expected response pattern counts 3049.7 E-step iteration 1 expectations of sufficient statistics 3059.8 Trace of EM parameter estimates 3069.9 MCMC cycle response pattern counts 3189.10 MCMC parameter draws from intervals of 1000 and summarystatistics 3209.11 Approximating Dirichlet priors from posterior means and
standard deviations for π31 3229.12 Response pattern counts for online calibration 3239.13 Average parameter values from initial and Online calibrations 32410.1 Actual and predicted outcomes for the hypothetical medicallicensure exam 36410.2 Logarithmic scores for ten student outcome vectors 36510.3 Deviance values for two ACED models 36610.4 Observed outcome for two items for Exercise 10.10 36711.1 Skill requirements for fraction subtraction items 37411.2 Equivalence classes and evidence models 37611.3 Summary statistics for binary-skills model 39011.4 Selected student responses 39111.5 Prior and posterior probabilities for selected examinees 39211.6 Summary statistics for binary-skills model, Admin 1 394
Trang 31List of Tables XXXIII
11.7 Summary statistics for binary-skills model, Admin 2 39511.8 Item-fit indices for the mixed-number subtraction test 40411.9 Person-fit p-values for selected students 404
13.1 Summary of the four processes 46913.2 Confusion matrix for binary proficiency and observable 47913.3 Expected accuracy matrix for observable PC3 in Task Exp4.1 483
13.4 MAP accuracy matrix for Task Exp4.1 48313.5 MAP accuracy matrix for Task Exp6.1 48313.6 Three-way table of two observables given proficiency variable 48413.7 Three-way table of two observables given marginal proficiency 48513.8 Calculation of expected weight of evidence 49613.9 Data for Exercise 13.4 50213.10 Ten randomly selected entries from a set of pretest data 50213.11 Expected accuracy matrix for two experimental tasks 50313.12 Expected accuracy matrix (normalized) for two multiple-choicetasks 50313.13 Data for differential task functioning detection problem
(Exercise 13.8) 50313.14 Data for conditional independence test problem (Exercise 13.9) 50413.15 Calculation of expected weight of evidence after one observation 50414.1 A hierarchical textual representation of science standards 50914.2 Potential observations related to scientific investigation 51414.3 Segments in the Biomass “mice” scenario 52214.4 Connecting knowledge representations with investigation steps 52514.5 Rules of evidence for table task 53215.1 Task and evidence models from the first Biomass segment 55115.2 Initial conditional distributions for observables 2–7 of Task 1 55615.3 Initial conditional distributions for observable 1 of Task 2 55815.4 Initial conditional probability distributions for all three
observables of Task 3 55915.5 Initial conditional distribution for observable 1 of Task 4 56015.6 Summary statistics of parameter prior distributions 56215.7 Observed responses 56415.8 Summary statistics of prior and posterior population
parameter distributions 56715.9 Summary statistics of item parameter distributions 56815.10 Prior and posterior expected proficiency levels 56915.11 Revised conditional distributions for observable 3 of Task 1 57115.12 Revised conditional probability table for observable 4 of Task 1 57215.13 Revised conditional distributions for observable 1 of Task 2 57315.14 A set of simulated preposterior predictive responses 576
Trang 32Part I
Building Blocks for Bayesian Networks
Trang 33Introduction
David Schum’s 1994 book, The Evidential Foundations of Probabilistic
Rea-soning, changed the way we thought about assessment Schum, a psychologist
cum legal scholar, was writing about evidence in the most familiar meaning ofthe word, looking at a lawyer’s use of evidence to prove or disprove a propo-sition to a jury However, Schum placed that legal definition in the context
of the many broader uses of the term “evidence” in other disciplines Schumnotes that scientists and historians, doctors and engineers, auto mechanics,and intelligence analysts all use evidence in their particular fields From theircross-disciplinary perspectives, philosophers, statisticians, and psychologistshave come to recognize basic principles of reasoning from imperfect evidencethat cut across these fields
Mislevy (1994) shows how to apply the idea of evidence to assessment
in education Say, for example, we wish to show that a student having pleted a reading course, is capable of reading, with comprehension, an article
com-from The New York Times We cannot open the student’s brain and observe
directly the level of comprehension, but we can ask the student questionsabout various aspects of articles she reads The answers provide evidence ofwhether or not the student comprehended what she read, and therefore hasthe claimed skill
Schum (1994) faced two problems when developing evidential arguments inpractical settings: uncertainty and complexity We face those same problems ineducational assessment and have come to adopt the same solutions: probabilitytheory and Bayesian networks
Schum (1994) surveys a number of techniques for representing impreciseand uncertain states of knowledge While approaches such as fuzzy sets, belieffunctions, and inductive probability all offer virtues and insights, Schum grav-itates to probability theory as a best answer Certainly, probability has hadthe longest history of practical application and hence is the best understood.Although other systems for representing uncertain states of knowledge, such asDempster–Shafer models (Shafer1976; Almond1995), may provide a broader
c
Springer Science+Business Media New York 2015 3
R G Almond et al., Bayesian Networks in Educational Assessment,
Statistics for Social and Behavioral Sciences, DOI 10.1007/978-1-4939-2125-6 1
Trang 34In complex situations, it can be difficult to calculate the probability of anevent, especially if there are many dependencies The solution is to draw apicture A graphical model, a graph whose nodes represent the variables andwhose edges represent dependencies between them, provides a guide for bothconstructing and computing with the statistical models Graphical models inwhich all the variables are discrete, have some particular computational advan-tages These are also known as Bayesian networks because of their capacity torepresent complex and changing states of information in a Bayesian fashion.Pearl (1988) popularized this approach to represent uncertainty, especially
in the artificial intelligence community Since then, it has seen an explosivegrowth
This book explores the implications of applying graphical models to cational assessment This is a powerful technique that supports the use ofmore complex models in testing, but is also compatible with the models andtechniques that have been developing in psychometrics over the last century.There is an immediate benefit of enabling us to build models which are closer
edu-to the cognitive theory of the domain we are testing (Pelligrino et al 2001).Furthermore, this approach can support the kind of complexity necessary fordiagnostic testing and complex constructed response or interactive tasks.This book divides the story of Bayesian networks in educational assess-ment into three parts PartIdescribes the basics of properties of a Bayesiannetwork and how they could be used to accumulate evidence about the state
of proficiency of a student Part IIdescribes how Bayesian networks can beconstructed, and in particular, how both the parameters and structure can
be refined with data Part III ties the mathematics of the network to theevidence-centered assessment design (ECD) framework for developing assess-ments and contains an extensive and detailed example The present chapterbriefly explores the question of why Bayesian networks provide an interestingchoice of measurement model for educational assessments
1.1 An Example Bayes Network
Bayesian networks are formally defined in Chap.4, but a simple example willhelp illustrate the basic concepts
Trang 351.1 An Example Bayes Network 5
Example 1.1 (Language Testing Example) (Mislevy 1995c ) Imagine a language assessment which is designed to report on four proficiency variables: Reading , Writing , Speaking and Listening This assessment has four types
of task: (1) a reading task, (2) a task which requires both writing and ing, (3) a task which requires speaking and either reading or listening, and (4) a listening task Evaluating the work product (selection, essay or speech) produces a single observable outcome variable for each task These are named
read-Outcome R, read-Outcome RW, read-Outcome RLS, and read-Outcome L respectively.
ReadingWriting
Speaking
Listening
Outcome ROutcome RWOutcome RSL
Outcome L
Fig 1.1 A graph for the Language Testing Example
A Bayesian network for the language test example of Mislevy (1995c) Rounded
rectangles in the picture represent variables in the model Arrows (“edges”) represent
patterns of dependence and independence among the variables This graph provides
a visual representation of the joint probability distribution over the variables in thepicture Reprinted with permission from Sage Publications
Figure 1.1 shows the graph associated with this example Following tions from ECD (cf Chaps 2 and 12 ), the nodes (rounded rectangles in the graph) for the proficiency variables are ornamented with a circle, and the nodes for the evidence variables are ornamented with a triangle The edges
conven-in the graph flow from the proficiency variables to the observable variables for tasks which require those proficiencies Thus the graph gives us the infor- mation about which skills are relevant for which tasks, providing roughly the same information that a Q-Matrix does in many cognitively diagnostic assess- ment models (Tatsuoka 1983 ).
The graphs used to visualize Bayesian networks, such as Fig.1.1, act as
a mechanism for visualizing the joint probability distribution over all of thevariables in a complex model, in terms of theoretical and empirical relation-ships among variables This graphical representation provides a shared work-ing space between subject matter experts who provide insight into the cogni-tive processes underlying the assessment, and psychometricians (measurementexperts) who are building the mathematical model In that sense, Bayesian
Trang 36As the name implies, Bayesian networks are based on Bayesian views ofstatistics (see Chap 3 for a review) The key idea is that a probability dis-tribution holds a state of knowledge about an unknown event As Bayesiannetworks represent a probability distribution over multiple variables, theyrepresent a state of knowledge about those variables.
Usually, the initial state of a Bayesian network in educational assessment isbased on the distribution of proficiency in the target population for an assess-ment and the relationship between those proficiencies and task outcomes in thepopulation The probability values could have come from theory, expert opin-ion, experiential data, or any mixture of the three Thus, the initial state ofthe Bayes net in Fig.1.1represents what we know about a student who entersthe testing center and sits down at a testing station to take the hypotheticallanguage test: the distribution of proficiencies in the students we typically see,and the range of performance we typically see from these students
As the student performs the assessment tasks, evaluating the work ucts using the appropriate evidence rules yields values for the observable out-come variables The values of the appropriate variables in the network arethen instantiated or set to these values, and the probability distributions
prod-in the network are updated (by recursive applications of Bayes rule) Theupdated network now represents our state of knowledge about this studentgiven the evidence we have observed so far This is a powerful paradigm forthe process of assessment, and leads directly to mechanisms for explainingcomplex assessments and adaptively selecting future observations (Chap 7).Chapter 13 describes how this can form the basis for an embedded scoringengine in an intelligent tutoring or assessment system
The fact that the Bayes net represents a complete Bayesian probabilitymodel has another important consequence: such models can be critiqued andrefined from data Complete Bayesian models provide a predictive probabilityfor any observable pattern of data Given the data pattern, the parameters ofthe model can be adjusted to improve the fit of the model Similarly, alterna-tive model structures can be proposed and explored to see if they do a betterjob of predicting the observed data Chapters 9 11 explore the problems ofcalibrating a model to data and learning model structure from data
Trang 371.2 Cognitively Diagnostic Assessment 7
Bayesian network models for assessments are especially powerful whenused in the context of ECD (Mislevy et al.2003b) Chap.2gives a brief intro-duction to some of the language used with ECD, while Chap.12explains it inmore detail The authors have used ECD to design a number of assessment,and our experience has caused us to come to value Bayesian networks fortwo reasons First, they are multivariate models appropriate for cognitivelydiagnostic assessment (Sect 1.2) Second, they help assessment designers toexplicitly draw the connection between measurement models and cognitivemodels that underlie them (Sect 1.3)
1.2 Cognitively Diagnostic Assessment
Most psychometricians practicing today work with high-stakes tests designedfor selection, placement, or licensing decisions This is no accident Errorsand inefficiencies in such tests can have high costs, both social and mone-tary, so it is worthwhile to employ someone to ensure that the reliability andvalidity of the resulting scores are high However, because of the prominence
of selection/placement tests, assumptions based on the selection/placementpurpose and the high stakes are often embedded in the justification for par-ticular psychometric models It is worth examining closely the assumptionswhich come from this purpose, to tease apart the purposes, the statistics, andthe psychology that are commingled in familiar testing practices
First, it is almost always good for a selection/placement assessment to
be unidimensional The purpose of a college admission officer looking at anassessment is to rank order the candidates so as to be better able to makedecisions about who to admit This rank ordering implies that the admissionsofficer wants the candidates in a single line The situation with licensure andcertification testing is similar; the concern is whether or not the candidatemakes the cut, and little else
Because of the high stakes, we are concerned with maximizing the validity
of the assessment—the degree to which it provides evidence for the claims wewould like to make about the candidate For selection and placement situa-tions, a practically important indicator of validity is the degree to which thetest correlates with a measure of the success after the selection or placement
Test constructors can increase this correlation by increasing the reliability of
the assessment—the precision of the measurement, or, roughly, the degree towhich the test is correlated with itself This can lead them to discard itemswhich are not highly correlated with the main dimension of the test, even ifthey are of interest for some other reason
Although high-stakes tests are not necessarily multiple choice, multiplechoice items often play a large role in them This is because multiple choice isparticularly cost effective The rules of evidence—procedures for determiningthe observable outcome variables—for multiple choice items are particularlyeasy to describe and efficient to implement With thoughtful item writing,
Trang 388 1 Introduction
multiple choice items can test quite advanced skills Most importantly, theytake little of the student’s time to answer A student can solve 20–30 multi-ple choice items in the time it would take to answer a complex constructedresponse task like an essay, thus increasing reliability While a complex con-structed response task may have a lower reliability than 20–30 multiple choiceitems, it may tap skills (e.g., generative use of language) that are difficult tomeasure in any other way Hence the complex constructed response item canincrease validity even though it decreases reliability
However, the biggest constraints on high stakes testing come from securityconcerns With high stakes comes incentive to cheat, and the measures tocircumvent cheating are costly These range from proctoring and verifying theidentity of all candidates, to creating alternative forms of the test The last ofthese produces a voracious appetite for new items as old ones are retired Italso necessitates the process of equating between scores on alternative forms
of the test
Increasingly, the end users of tests want more than just a single score touse for selection or placement They are looking for a set of scores to helpdiagnose problems the examinee might be facing This is an emerging field
called cognitively diagnostic assessment (Leighton and Gierl2007; Rupp et al
2010) The “cognitive” part of this name indicates that scores are chosen toreflect a cognitive model of how students acquire skills (see Sect 1.3) The
“diagnostic” part of the name reflects a phenomenon that seeks to identifyand provide remedy for some problem in a students’ state of proficiency Suchdiagnostic scores can be used for a variety of purposes: as an adjunct to a highstakes test to help a candidate prepare, as a guidance tool to help a learnerchoose an appropriate instructional strategy, or even shaping instructions onthe fly in an intelligent tutoring system Often these purposes carry muchlower stakes, and hence less stringent requirements for security
Nowhere is the interplay between high stakes and diagnostic assessment
more apparent than in the No Child Left Behind (NCLB) Act passed by the
U.S Congress in 2002 and the Race To the Top program passed as part ofthe American Reinvestment and Recovery Act of 2009 The dual purpose ofassessments—accountability and diagnosis at some level—remains a part ofthe U.S educational landscape Under these programs, all children are tested
to ensure that they are meeting the state standards Schools must be makingadequate progress toward bringing all students up to the standards This, in
turn, means that educators are very interested in why students are not yet
meeting the standards and what they can do to close the gap They needdiagnostic assessment to supplement the required accountability tests to helpthem identify problems and choose remedies
When we switch the purpose from selection to diagnosis, everythingchanges First and foremost, a multidimensional concept of proficiency usuallyunderlies cognitively diagnostic scoring (A metaphor: Whereas for a selectionexam we might have been content with knowing the volume of the examinee,
Trang 391.2 Cognitively Diagnostic Assessment 9
in a diagnostic assessment we want to distinguish examinees who are tall butnarrow and shallow from those who are short, wide and shallow and thosewho are short, narrow and deep.) As a consequence, the single score becomes
a multidimensional profile of student proficiency Lou DiBello (personal
com-munication) referred to such tests as profile score assessments.
The most important and difficult part of building a multidimensionalmodel of proficiency is identifying the right variables The variables (or suit-able summaries) must be able to produce scores that the end users of the
assessment care about: scores which relate to claims we wish to make about
the student and educational decisions that must be made That is, it is notenough that the claims concern what students know and can do; they must
be organized in ways that help teachers improve what they know and can do
A highly reliable and robust test built around the wrong variables will not beuseful to end users and consequently will fall out of use
Another key difference between a single score test and a profile score test
is that we must specify how each task outcome depends on the proficiencyvariables In a profile score assessment, for each task outcome, we must answerthe questions “What proficiencies are required?”; “How are they related inthese requirements?”; and “To what degree are they involved?” This is thekey to making the various proficiency variables identifiable In a single scoreassessment, each item outcome loads only onto the main variable; the onlyquestion is with what strength Consequently, assessment procedures thatare tuned to work with single score assessments will not provide all of theinformation necessary to build a profile score test
Suppes (1969) introduced a compact representation of the relationship
between proficiency and outcome variables for a diagnostic test called the Matrix In the Q-Matrix, columns represent proficiency variables and rows
Q-represent items (observable outcome variables) A one is placed in the cellswhere the proficiency is required for the item, and a zero is placed in the othercells Note that an alternative way to represent graphs is through a matrixwith ones where an edge is present and zero where there is no edge Thus, there
is a close connection between Bayesian network models and other diagnostic
models which use the Q-Matrix (Tatsuoka 1983; Junker and Sijtsma 2001;Roussos et al.2007b)
The situation can become even more complicated if the assessment includescomplex constructed response tasks In this case, several aspects of a student’swork can provide evidence of different proficiencies Consequently, a task mayhave multiple observable outcomes For example, a rater could score an essay
on how well the candidate observed the rules of grammar and usage, how wellthe candidate addressed the topic, and how well the candidate structured theargument These three outcomes would each draw upon different subsets ofthe collection of proficiencies measured by the assessment
Some of the hardest work in assessment with complex constructed responsetasks goes into defining the scored outcome variables Don Melnick, who for
Trang 4010 1 Introduction
several years led the National Board of Medical Examiners (NBME) project
on computer-based case management problems, observed “The NBME hasconsistently found the challenges in the development of innovative testingmethods to lie primarily in the scoring arena Complex test stimuli result incomplex responses which require complex models to capture and appropriatelycombine information from the test to create a valid score” (Melnick 1996,
p 117)
The best way to do this is to design forward We do not want to wait for adesigner to create marvelous tasks, collect whatever data result, and throw itover the wall for the psychometrician to figure out “how to score it.” The mostrobust conclusion from the cognitive diagnosis literature is this: Diagnosticstatistical modeling is far more effective when applied in conjunction with taskdesign from a cognitive framework that motivates both task construction andmodel structure, than when applied retrospectively to existing assessments(Leighton and Gierl2007)
Rather, we start by asking what we can observe that will provide evidence
that the examinee has the skill we are looking for We build situations withfeatures that draw on those skills, and call for the examinee to say, do, or
make something that provides evidence about them—work products We call the key features of this work observable outcome variables, and the rules for computing them, rules of evidence For example, in a familiar essay test the
observable outcomes are the one or more scores assigned by a rater, and therules of evidence are the rubrics the rater uses to evaluate the essay as to itsqualities
A richer example is HYDRIVE (Gitomer et al.1995), an intelligent ing system built for the US Air Force and designed to teach troubleshoot-ing for the hydraulics systems of the F-15 aircraft An expert/novice study
tutor-of hydraulics mechanics revealed that experts drew on a number tutor-of bleshooting strategies that they could bring to bear on problems (Steinbergand Gitomer 1996) For example, they might employ a test to determinewhether the problem was in the beginning or end of a series of componentsthat all had to work for a flap to move when a lever was pulled This strategy
trou-is called “space splitting” because it splits the problem space into two parts(Newell and Simon 1972) HYDRIVE was designed to capture informationnot only about whether or not the mechanic correctly identified and repairedthe problem, but also about the degree to which the mechanic employed effi-cient strategies to solve the problem Both of these were important observableoutcomes
However, when there are multiple aspects of proficiency and tasks can havemultiple outcomes, the problem of determining the relationships between pro-ficiencies and observable variables becomes even harder In HYDRIVE, bothknowledge of general troubleshooting strategies and the specific system beingrepaired were necessary to solve most problems Thus each task entailed amany-to-many mapping between observable outcomes and proficiency vari-ables