2015 (statistics for social and behavioral sciences) russell g almond, robert j mislevy, linda s steinberg, duanli yan, david m williamson (auth ) bayesian networks in educational assessment springer v

NetPASS fea-tured principled design of proﬁciency, task, and evidence models and a Bayes... To try to avoid confusions with the distributions of the parameters of dis-tributions, the ter

Trang 2

Statistics for Social and Behavioral Sciences

Trang 3

Statistics for Social and Behavioral Sciences (SSBS) includes monographs andadvanced textbooks relating to education, psychology, sociology, political sci-ence, public policy, and law.

More information about this series at http://www.springer.com/series/3463

Trang 4

Russell G Almond • Robert J Mislevy

Linda S Steinberg • Duanli Yan

David M Williamson

Bayesian Networks in Educational Assessment

2123

Trang 5

Russell G Almond Duanli Yan

Statistics for Social and Behavioral Sciences

ISBN 978-1-4939-2124-9 ISBN 978-1-4939-2125-6 (eBook)

DOI 10.1007/978-1-4939-2125-6

Library of Congress Control Number: 2014958291

Springer New York Heidelberg Dordrecht London

c

Springer Science+Business Media New York 2015

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprint- ing, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adap- tation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and tion in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect

informa-to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Trang 6

Forward into future times we go

Over boulders standing in out way

Rolling them aside because we know Others follow in our steps one day

Under deepest earth the gems are found Reaching skyward ’till we grasp the heights Climbing up to where the view surrounds Hidden valleys oﬀer new delights

Inch by inch and yard by yard until Luck brings us to the hidden vale

Desiring a place to rest yet still

Returning home now to tell the tale Ever knowing when that day does come New hands will take up work left undone

Trang 7

Bayesian Inference in Educational Assessments (BNinEA) is the direct issue

of two projects, and the descendant, cousin, or sibling of many more We aregrateful for all we have learned in these experiences from many collaboratorsand supporters over the years

The first direct ancestor is the series of workshops we have presented atthe annual meeting of the National Council on Measurement in Education(NCME) almost every year since 2001 We are grateful to NCME for thisopportunity and to ETS for support in developing the materials they havegranted us permission to use Workshop participants will recognize many con-cepts, algorithms, figures, and hands-on exercises from these sessions.Its second direct ancestor is the Portal project at Educational TestingService It was here that Linda, Russell, and Bob fleshed out the evidence-centered design (ECD) assessment framework, implemented it in an objectmodel and design system, and carried out the applications using Bayes nets

We are grateful to Henry Braun and Drew Gitomer, successive vice-presidents

of Research, and Len Swanson, head of New Product Development, for porting Portal Our collaborators included Brian Berenbach, Marjorie Biddle,Lou DiBello, Howie Chernik, Eddie Herskovits, Cara Cahallan Laitusis, JanLukas, Alexander Matukhin, and Peggy Redman

sup-Biomass (Chaps 14 and 15) was a ground-up demonstration of Portal,ECD design, standards-based science assessment, web-delivered interactivetesting, with automated scoring of inquiry investigations and Bayes net mea-surement models The Biomass team included Andy Baird, Frank Jenkins,our subject matter lead Ann Kindﬁeld, and Deniz Senturk Our subject mat-ter consultants were Scott Kight, Sue Johnson, Gordon Mendenhall, CathrynRubin, and Dirk Vanderklein

The Networking Performance Skill System (NetPASS) project was anonline performance-based assessment activity for designing and troubleshoot-ing computer networks It was developed in collaboration with the Cisco Net-working Academy Program, and led by John Behrens of Cisco NetPASS fea-tured principled design of proﬁciency, task, and evidence models and a Bayes

Trang 8

VIII Acknowledgements

net psychometric model using the methods described in BNinEA Team bers included Malcolm Bauer, Sarah DeMark, Michael Faron, Bill Frey, DennisFrezzo, Tara Jennings, Peggy Redman, Perry Reinert, and Ken Stanley TheNetPASS prototype was foundational for the Packet Tracer simulation sys-tem and Aspire game environment that Cisco subsequently developed, andmillions of Cisco Network Academy (CNA) students around the world haveused operationally to learn beginning network engineering skills

mem-The DISC scoring engine was a modular Bayes-net-based evidence lation package developed for the Dental Interactive Simulations Corporation(DISC), by the Chauncey Group International, ETS, and the DISC Scor-ing Team: Barry Wohlgemuth, DISC President and Project Director; LynnJohnson, Project Manager; Gene Kramer; and ﬁve core dental hygienist mem-bers, Phyllis Beemsterboer, RDH, Cheryl Cameron, RDH, JD, Ann Eshenaur,RDH, Karen Fulton, RDH, and Lynn Ray, RDH Jay Breyer was the ChaunceyGroup lead, and was instrumental in conducting the expert–novice studies andconstructing proﬁciency, task, and evidence models

accumu-Adaptive Content with Evidence-based Diagnosis (ACED) was the child of Valerie J Shute It had a large number of contributors including LarryCasey, Edith Aurora Graf, Eric Hansen, Waverly Hester, Steve Landau, PeggyRedman, Jody Underwood, and Diego Zapata-Rivera ACED developmentand data collection were sponsored by National Science Foundation Grant

brain-No 3013202 The complete ACED models and data are available online; seethe Appendix for details

Bob’s initial forays into applying Bayesian networks in educational ment were supported in part by grants from the Office of Naval Research(ONR) and from the National Center for Research on Evaluation, Standards,and Student Testing (CRESST) at the University of California at Los Ange-les We are grateful to Charles Davis, Project Officer of ONR’s Model-BasedMeasurement program and Eva Baker, Director of CRESST, for their sup-port Much of the work we draw on here appears in ONR and CRESSTresearch reports The findings and opinions expressed in BNinEA, however,

assess-do not reﬂect the positions or policies of ONR, the National Institute on dent Achievement, Curriculum, and Assessment, the Oﬃce of EducationalResearch and Improvement, or the U.S Department of Education

Stu-Some of the ideas in this book are based on Russell’s previous work ontheGraphical-Belief project Thanks to Doug Martin at StatSci for spon-soring that project as well as the NASA Small Business Innovation Research(SBIR) program for supporting initial development David Madigan made anumber of contributions to that work, particularly pointing out the impor-tance of the weight of evidence Graphical-Belief is based on the earlierwork on Belief while Russell was still a student at Harvard Art Dempsterand Augustine Kong both provided valuable advise for that work The work

of Glenn Shafer and David Schum in thinking about the representation ofevidence has been very useful as well Those contributions are documented inRussell’s earlier book

Trang 9

Acknowledgements IX

Along with the NCME training sessions and working on NetPASS andDISC, David completed his doctoral dissertation on model criticism in Bayesnets in assessment at Fordham University under John Walsh, with Russelland Bob as advisors

Hydrive was an intelligent tutoring system for helping trainees learn totroubleshoot the hydraulics subsystems of the F-15 aircraft Drew Gitomerwas the Principal Investigator and Linda was the Project Manager Theproject was supported by Armstrong Laboratories of the US Air Force, underthe Project Officer Sherrie Gott Design approaches developed in Hydrivewere extended and formalized in ECD Bob and Duanli worked with Drewand Linda to create and test an offline Bayes net scoring model for Hydrive.Russell and Bob used drafts of BNinEA in classes at Florida State Univer-sity (FSU) and the University of Maryland, respectively We received muchhelpful feedback from students to clarify our ideas and sharpen our presen-tations Students at Maryland providing editorial and substantive contribu-tions included Younyoung Choi, Roy Levy, Junhui Liu, Michelle Riconscente,and Daisy Wise Rutstein Students at FSU, providing feedback and advice,included Mengyao Cui, Yuhua Guo, Yoon Jeon Kim, Xinya Liang, ZhongtianLin, Sicong Liu, Umit Tokac, Gertrudes Velasquez, Haiyan Wu, and Yan Xia.Kikumi Tatsuoka has been a visionary pioneer in the field of cognitiveassessment, whose research is a foundation upon which our work and thatmany others in the assessment and psychometric communities builds Weare grateful for her permission to use her mixed-number subtraction data inChaps 6 and 11

Brent Boerlage, of Norsys Software Corp., has supported the book in anumber of ways First and foremost, he has made the student version of Neticaavailable for free, which has been exceedingly useful in our classes and onlinetraining Second, he has oﬀered general encouragement for the project andoﬀered to add some of our networks to his growing Bayes net library.Many improvements to a draft of the book resulted from rigorous attentionfrom the ETS review process We thank Kim Fryer, the manager of editingservices in the Research and Development division at ETS, Associate EditorsDan Eignor and Shelby Haberman, and the reviewers of individual chapters:Malcolm Bauer, Jianbin Fu, Aurora Graf, Shelby Haberman, Yue Jia, Feifei

Li, Johnny Lin, Ru Lu, Frank Rijmen, Zhan Shu, Sandip Sinharay, LawrenceSmith, Matthias von Davier, and Diego Zapata-Rivera

We thank ETS for their continuing support for BNinEA and the variousprojects noted above as well as support through Bob’s position as Frederic

M Lord Chair in Measurement and Statistics under Senior Vice-Presidentfor Research, Ida Lawrence We thank ETS for permission to use the ﬁguresand tables they own and their assistance in securing permission for the rest,through Juana Betancourt, Stella Devries, and Katie Faherty

We are grateful also to colleagues who have provided support in more eral and pervasive ways over the years, including John Mark Agosta, MalcolmBauer, Betsy Becker, John Behrens, Judy Goldsmith, Geneva Haertel, Sidney

Trang 10

gen-X Acknowledgements

Irvine, Kathryn Laskey, Roy Levy, Bob Lissitz, John Mazzeo, Ann Nicholson,Val Shute, and Howard Wainer

It has taken longer than it probably should have to complete Bayesian

Networks in Educational Assessment For their continuing encouragement

and support, we are indebted to our editors at Springer: John Kimmel, whobrought us in, and Jon Gurstelle and Hannah Bracken, who led us out

Trang 11

Using This Book

An early reviewer urged us to think of this book not as a primer in Bayesiannetworks (there are already several good titles available, referenced in thisvolume), but to focus instead on the application: the process of building themodel Our early reviewers also thought that a textbook would be more usefulthan a monograph, so we have steered this volume in that particular way Inparticular, we have tried to make the book understandable to any reasonablyintelligent graduate students (and several of our quite intelligent graduatestudents have let us know when we got too obscure), as this should providethe broadest possible audience

In particular, most chapters include exercises at the end We have foundthrough both our classes and the NCME training sessions, that students donot learn from our lectures or writing (no matter how brilliant) but fromtrying to apply what they heard and read to new problems We would urgeall readers, even just the ones skimming to try the exercises Solutions areavailable from Springer or from the authors

Another thing we have found very valuable in using the volume ally is starting the students early with a Bayesian network tool AppendixA

education-lists several tools, and gives pointers to more Even in the early chapters,merely using the software as a drawing tool helps get students thinking aboutthe ideas Of course, student projects are an important part of any courselike this Many of the Bayes net collections used in the example are availableonline; AppendixAprovides the details

We have divided the book into three parts, which reﬂect diﬀerent levels

of complexity PartI is concerned with the basics of Bayesian networks, ticularly developing the background necessary to understand how to use aBayesian network to score a single student It begins with a brief overview ofthe ECD The approach is key to understanding how to use Bayesian networks

par-as mepar-asurement models par-as an integral component of par-assessment design anduse from the beginning, rather than simply as a way to analyze data once it

is in hand (To do the latter is to be disappointed—and it is not the fault ofBayes nets!) It ends with Chap 7, which goes beyond the basics to start to

Trang 12

XII Using This Book

describe how the Bayesian model supports inference more generally Part II

takes up the issue of the calibrating the networks using data from students.This is too complex a topic to cover in great depth, but this section exploresparameterizations for Bayesian networks, looks at updating models from dataand model criticism, and ends with a complete example Part III expandsfrom the focus on mechanics to embedding the Bayesian network in an assess-ment system Two chapters describe the conceptual assessment frameworkand the four-process delivery architecture of ECD in greater depth, showingthe intimate connections among assessment arguments, design structures, andthe function of Bayesian networks in inference Two more chapters are thendevoted to the implementation of Biomass, one of the ﬁrst assessments to bedesigned from the ground up using ECD

When we started this project, it was our intention to write a ion volume about evidence-centered assessment design Given how long thisproject has taken, that second volume will not appear soon Chapters2, 12,and 13 are probably the best we have to oﬀer at the moment Russell hasused them with some success as standalone readings in his assessment designclass Although ECD does not require Bayesian networks, it does involve alot of Bayesian thinking about evidence Readers who are primarily interested

compan-in ECD may ﬁnd that readcompan-ing all of Part I and exploring simple Bayes netexamples helps deepen their understanding of ECD, then moving to Chaps.12

and 13 if they want additional depth, and the Biomass chapters to see theideas in practice

Several of our colleagues in the Uncertainty in Artiﬁcial Intelligence munity (the home of much of the early work on Bayesian Networks) havebemoaned the fact that most of the introductory treatises on Bayesian net-works fall short in the area of helping the reader translate between a speciﬁcapplication and the language of Bayesian networks Part of the challenge here

com-is that it com-is difficult to do thcom-is in the absence of a specific application Thcom-isbook starts to fill that gap One advantage of the educational application isthat it is fairly easy to understand (most people having been subjected toeducational assessment at least once in their lives) Although some of thelanguage in the book is specific to the field of education, much of the develop-ment in the book comes from the authors’ attempt to translate the language

of evidence from law and engineering to educational assessment We hope thatreaders from other ﬁelds will ﬁnd ways to translate it to their own work aswell

In an attempt to create a community around this book, we have created

a Wiki for evidence-centered assessment design (http://ecd.ralmond.net/ecdwiki/ECD/ECD/) Speciﬁc material to support the book, including examplenetworks and data, are available at the same site (http://ecd.ralmond.net/BN/BN) We would like to invite our readers to browse the material there and

to contribute (passwords can be obtained from the authors)

Trang 13

Random Variables

Random variables in formulae are often indicated by capital letters set in italic

type, e.g., X, while a value of the corresponding random variable is indicated

as a lowercase letter, e.g., x.

Vector-valued random variables and constants are set in boldface For

example, X is a vector valued random variable and x is a potential value for X.

Random variables in Bayesian networks with long descriptive names are

usually set in italic type when referenced in the text, e.g., RandomVariable.

If the long name consists of more than one word, capitalization is often used

to indicate word boundaries (so-called CamelCase)

When random variables appear in graphs they are often preceded by anicon indicating whether they are defined in the proficiency model or the evi-dence model Variables preceded by a circle,, are proficiency variables, while

variables preceded by a triangle,

, are deﬁned locally to an evidence model.They are often but not always observable variables

The states of such random variables are given in typewriter font, e.g., Highand Low

Note that Bayesian statistics does not allow ﬁxed but unknown quantities.For this reason the distinction between variable and parameter in classicalstatistics is not meaningful In this book, the term “variable” is used to refer

to a quantity speciﬁc to a particular individual taking the assessment and theterm “parameter” is used to indicate quantities that are constant across allindividuals

Sets

Sets of states and variables are indicated with curly braces, e.g.,{High, Medium,

Low} The symbol x ∈ A is used to indicate that x is an element of A The

Trang 14

XIV Notation

elements inside the curly braces are unordered, so {A1, A2} = {A2, A1}.

The use of parenthesis indicates that the elements are ordered, so that

(A1, A2) = (A2, A1).

The symbols∪ and ∩ are used for the union and intersection of two sets.

If A and B are sets, then A ⊂ B is used to indicate that A is a proper subset

of B, while A ⊆ B also allows the possibility that A = B.

If A refers to an event, then A refers to the complement of the event; that

is, the event that A does not occur.

Ordered tuples indicating vector valued quantities are indicated with

parenthesis, e.g., (x1, , x k)

Occasionally, the states of a variable have a meaningful order The symbol

is used to state that one symbol is lower than the other Thus High Low.

The quantiﬁer ∀x is used to indicate “for all possible values of x.” The

quantiﬁer ∃x is used to indicate that an element x exists that satisﬁes the

Probability Distributions and Related Functions

The notation P(X) is used to refer to the probability of an event X It is also used to refer to the probability distribution of a random variable X with the

hope that the distinction will be obvious from context

To try to avoid confusions with the distributions of the parameters of

dis-tributions, the term law is used for a probability distribution over a parameter and the term distribution is used for the distribution over a random variable, although the term distribution is also used generically.

The notation P(X |Y ) is used to refer to the probability of an event X

given that another event Y has occurred It is also used for the collection of probability distributions for a random variable X given the possible instanti- ations of a random variable Y Again we hope that this loose use of notation

will be clear from context

If the domain of the random variable is discrete, then the notation p(X) is

used for the probability mass function If the domain of the random variable is

continuous, then the notation f (X) is used to refer to the probability density The notation E[g(X)] is used for the expectation of the function g(X) with respect to the distribution P(X) When it is necessary to emphasize

Trang 15

Notation XV

the distribution, then the random variables are placed as a subscript Thus,

E X [g(X)] is the expectation of g(X) with respect to the distribution P(X) and E X|Y [g(X)] is the expectation with respect to the distribution P(X |Y ).

The notation Var(X) is used to refer to the variance of the random

vari-able X The Var(X) is a matrix giving the Var(X k) on the diagonal and the

covariance of X i and X j in the oﬀ-diagonal elements

If A and B are two events or two random variables, then the notation

A ⊥⊥ B and I(A|∅|B) is used to indicate that A is independent of B The

notations A ⊥⊥ B | C and I(A|C|B) indicate that A is independent of B

when conditioned on the value of C (or the event C).

The notation N(μ, σ2) is used to refer to a normal distribution with mean

μ and variance σ2; N+(μ, σ2) refers to the same distribution truncated at

zero (so the random variable is strictly positive) The notation Beta(a, b) is used to refer to a beta distribution with parameters a and b The notation Dirichlet(a1, , a K ) is used to refer to K-dimensional Dirichlet distribution with parameters a1, , a K The notation Gamma(a, b) is used for a gamma distribution with shape parameter a and scale parameter b.

The symbol ∼ is used to indicate that a random variable follows a

par-ticular distribution Thus X ∼ N(0, 1) would indicate that X is a random

variable following a normal distribution with mean 0 and variance 1

Note that Γ (n) = (n − 1)! when n is a positive integer.

The notation B(a, b) is used for the beta function:

Trang 16

XVI Notation

The notation n

y

is used to indicate the combinatorial function (n−y)!y! n! .

The extended combinatorial function n

Usual Use of Letters for Indices

The letter i is usually used to index individuals.

The letter j is usually used to index tasks, with J being the total number

of tasks

The letter k is usually used to index states of a variable, with K being the total number of states The notation k[X] is an indicator which is 1 when the random variable X takes on the kth possible value, and zero otherwise.

If x = (x1, , x K ) is a vector, then x< k refers to the ﬁrst k −1 elements of

x, (x1, , x k−1 ), and x > k refers to the last K − k elements (xk+1 , , x K)

They refer to the empty set when k = 1 or k = K The notation x −k refers

to all elements except the jth; that is, (x1, , x k−1 , x k+1 , , x K)

Trang 17

Part I Building Blocks for Bayesian Networks

1 Introduction 3

1.1 An Example Bayes Network 4

1.2 Cognitively Diagnostic Assessment 7

1.3 Cognitive and Psychometric Science 11

1.4 Ten Reasons for Considering Bayesian Networks 14

1.5 What Is in This Book 16

2 An Introduction to Evidence-Centered Design 19

2.1 Overview 20

2.2 Assessment as Evidentiary Argument 21

2.3 The Process of Design 23

2.4 Basic ECD Structures 26

2.4.1 The Conceptual Assessment Framework 27

2.4.2 Four-Process Architecture for Assessment Delivery 34

2.4.3 Pretesting and Calibration 38

2.5 Conclusion 39

3 Bayesian Probability and Statistics: a Review 41

3.1 Probability: Objective and Subjective 41

3.1.1 Objective Notions of Probability 42

3.1.2 Subjective Notions of Probability 43

3.1.3 Subjective–Objective Probability 45

3.2 Conditional Probability 46

3.3 Independence and Conditional Independence 51

3.3.1 Conditional Independence 53

3.3.2 Common Variable Dependence 54

3.3.3 Competing Explanations 55

3.4 Random Variables 57

3.4.1 The Probability Mass and Density Functions 57

3.4.2 Expectation and Variance 60

Trang 18

XVIII Contents

3.5 Bayesian Inference 62

3.5.1 Re-expressing Bayes Theorem 63

3.5.2 Bayesian Paradigm 63

3.5.3 Conjugacy 67

3.5.4 Sources for Priors 72

3.5.5 Noninformative Priors 74

3.5.6 Evidence-Centered Design and the Bayesian Paradigm 76 4 Basic Graph Theory and Graphical Models 81

4.1 Basic Graph Theory 82

4.1.1 Simple Undirected Graphs 83

4.1.2 Directed Graphs 83

4.1.3 Paths and Cycles 84

4.2 Factorization of the Joint Distribution 86

4.2.1 Directed Graph Representation 86

4.2.2 Factorization Hypergraphs 88

4.2.3 Undirected Graphical Representation 90

4.3 Separation and Conditional Independence 91

4.3.1 Separation and D-Separation 91

4.3.2 Reading Dependence and Independence from Graphs 93

4.3.3 Gibbs–Markov Equivalence Theorem 94

4.4 Edge Directions and Causality 95

4.5 Other Representations 97

4.5.1 Inﬂuence Diagrams 97

4.5.2 Structural Equation Models 99

4.5.3 Other Graphical Models 100

5 Eﬃcient Calculations 105

5.1 Belief Updating with Two Variables 106

5.2 More Eﬃcient Procedures for Chains and Trees 111

5.2.1 Propagation in Chains 112

5.2.2 Propagation in Trees 116

5.2.3 Virtual Evidence 119

5.3 Belief Updating in Multiply Connected Graphs 122

5.3.1 Updating in the Presence of Loops 122

5.3.2 Constructing a Junction Tree 123

5.3.3 Propagating Evidence Through a Junction Tree 134

5.4 Application to Assessment 135

5.4.1 Proﬁciency and Evidence Model Bayes Net Fragments 137 5.4.2 Junction Trees for Fragments 139

5.4.3 Calculation with Fragments 143

5.5 The Structure of a Test 145

5.5.1 The Q-Matrix for Assessments Using Only Discrete Items 146

5.5.2 The Q-Matrix for a Test Using Multi-observable Tasks 147 5.6 Alternative Computing Algorithms 149

Trang 19

Contents XIX

5.6.1 Variants of the Propagation Algorithm 150

5.6.2 Dealing with Unfavorable Topologies 150

6 Some Example Networks 157

6.1 A Discrete IRT Model 158

6.1.1 General Features of the IRT Bayes Net 161

6.1.2 Inferences in the IRT Bayes Net 162

6.2 The “Context” Eﬀect 166

6.3 Compensatory, Conjunctive, and Disjunctive Models 172

6.4 A Binary-Skills Measurement Model 178

6.4.1 The Domain of Mixed Number Subtraction 178

6.4.2 A Bayes Net Model for Mixed-Number Subtraction 180

6.4.3 Inferences from the Mixed-Number Subtraction Bayes Net 184

6.5 Discussion 190

7 Explanation and Test Construction 197

7.1 Simple Explanation Techniques 198

7.1.1 Node Coloring 198

7.1.2 Most Likely Scenario 200

7.2 Weight of Evidence 201

7.2.1 Evidence Balance Sheet 202

7.2.2 Evidence Flow Through the Graph 205

7.3 Activity Selection 209

7.3.1 Value of Information 209

7.3.2 Expected Weight of Evidence 213

7.3.3 Mutual Information 215

7.4 Test Construction 215

7.4.1 Computer Adaptive Testing 216

7.4.2 Critiquing 217

7.4.3 Fixed-Form Tests 220

7.5 Reliability and Assessment Information 224

7.5.1 Accuracy Matrix 225

7.5.2 Consistency Matrix 230

7.5.3 Expected Value Matrix 230

7.5.4 Weight of Evidence as Information 232

Part II Learning and Revising Models from Data 8 Parameters for Bayesian Network Models 241

8.1 Parameterizing a Graphical Model 241

8.2 Hyper-Markov Laws 244

8.3 The Conditional Multinomial—Hyper-Dirichlet Family 246

8.3.1 Beta-Binomial Family 247

Trang 20

XX Contents

8.3.2 Dirichlet-Multinomial Family 248

8.3.3 The Hyper-Dirichlet Law 248

8.4 Noisy-OR and Noisy-AND Models 250

8.4.1 Separable Inﬂuence 254

8.5 DiBello’s Eﬀective Theta Distributions 254

8.5.1 Mapping Parent Skills to θ Space 256

8.5.2 Combining Input Skills 257

8.5.3 Samejima’s Graded Response Model 260

8.5.4 Normal Link Function 263

8.6 Eliciting Parameters and Laws 267

8.6.1 Eliciting Conditional Multinomial and Noisy-AND 269

8.6.2 Priors for DiBello’s Eﬀective Theta Distributions 272

8.6.3 Linguistic Priors 273

9 Learning in Models with Fixed Structure 279

9.1 Data, Models, and Plate Notation 279

9.1.1 Plate Notation 280

9.1.2 A Bayesian Framework for a Generic Measurement Model 282

9.1.3 Extension to Covariates 284

9.2 Techniques for Learning with Fixed Structure 287

9.2.1 Bayesian Inference for the General Measurement Model 288 9.2.2 Complete Data Tables 289

9.3 Latent Variables as Missing Data 297

9.4 The EM Algorithm 298

9.5 Markov Chain Monte Carlo Estimation 305

9.5.1 Gibbs Sampling 308

9.5.2 Properties of MCMC Estimation 309

9.5.3 The Metropolis–Hastings Algorithm 312

9.6 MCMC Estimation in Bayes Nets in Assessment 315

9.6.1 Initial Calibration 316

9.6.2 Online Calibration 321

9.7 Caution: MCMC and EM are Dangerous! 324

10 Critiquing and Learning Model Structure 331

10.1 Fit Indices Based on Prediction Accuracy 332

10.2 Posterior Predictive Checks 335

10.3 Graphical Methods 342

10.4 Diﬀerential Task Functioning 347

10.5 Model Comparison 350

10.5.1 The DIC Criterion 350

10.5.2 Prediction Criteria 353

10.6 Model Selection 354

10.6.1 Simple Search Strategies 355

10.6.2 Stochastic Search 356

Trang 21

Contents XXI

10.6.3 Multiple Models 357

10.6.4 Priors Over Models 357

10.7 Equivalent Models and Causality 358

10.7.1 Edge Orientation 358

10.7.2 Unobserved Variables 358

10.7.3 Why Unsupervised Learning cannot Prove Causality 360

10.8 The “True” Model 362

11 An Illustrative Example 371

11.1 Representing the Cognitive Model 372

11.1.1 Representing the Cognitive Model as a Bayesian Network 372

11.1.2 Representing the Cognitive Model as a Bayesian Network 377

11.1.3 Higher-Level Structure of the Proﬁciency Model; i.e., p( θ|λ) and p(λ) 379

11.1.4 High Level Structure of the Evidence Models; i.e., p(π) 381 11.1.5 Putting the Pieces Together 382

11.2 Calibrating the Model with Field Data 382

11.2.1 MCMC Estimation 383

11.2.2 Scoring 389

11.2.3 Online Calibration 392

11.3 Model Checking 397

11.3.1 Observable Characteristic Plots 398

11.3.2 Posterior Predictive Checks 401

11.4 Closing Comments 405

Part III Evidence-Centered Assessment Design 12 The Conceptual Assessment Framework 411

12.1 Phases of the Design Process and Evidentiary Arguments 414

12.1.1 Domain Analysis and Domain Modeling 414

12.1.2 Arguments and Claims 418

12.2 The Student Proﬁciency Model 424

12.2.1 Proﬁciency Variables 424

12.2.2 Relationships Among Proﬁciency Variables 428

12.2.3 Reporting Rules 433

12.3 Task Models 438

12.4 Evidence Models 443

12.4.1 Rules of Evidence (for Evidence Identiﬁcation) 444

12.4.2 Statistical Models of Evidence (for Evidence Accumulation) 448

12.5 The Assembly Model 453

12.6 The Presentation Model 458

Trang 22

XXII Contents

12.7 The Delivery Model 46012.8 Putting It All Together 461

13 The Evidence Accumulation Process 467

13.1 The Four-Process Architecture 46813.1.1 A Simple Example of the Four-Process Framework 47113.2 Producing an Assessment 47413.2.1 Tasks and Task Model Variables 47413.2.2 Evidence Rules 47813.2.3 Evidence Models, Links, and Calibration 48613.3 Scoring 48813.3.1 Basic Scoring Protocols 48913.3.2 Adaptive Testing 49313.3.3 Technical Considerations 49713.3.4 Score Reports 500

14 Biomass: An Assessment of Science Standards 507

14.1 Design Goals 50714.2 Designing Biomass 51014.2.1 Reconceiving Standards 51014.2.2 Defining Claims 51314.2.3 Defining Evidence 51414.3 The Biomass Conceptual Assessment Framework 51514.3.1 The Proficiency Model 51514.3.2 The Assembly Model 51914.3.3 Task Models 52314.3.4 Evidence Models 52914.4 The Assessment Delivery Processes 53514.4.1 Biomass Architecture 53614.4.2 The Presentation Process 53814.4.3 Evidence Identification 54014.4.4 Evidence Accumulation 54114.4.5 Activity Selection 54314.4.6 The Task/Evidence Composite Library 54314.4.7 Controlling the Flow of Information Among the

Processes 54414.5 Conclusion 545

15 The Biomass Measurement Model 549

15.1 Specifying Prior Distributions 55015.1.1 Specification of Proficiency Variable Priors 55215.1.2 Specification of Evidence Model Priors 55415.1.3 Summary Statistics 56015.2 Pilot Testing 56115.2.1 A Convenience Sample 561

Trang 23

Contents XXIII

15.2.2 Item and other Exploratory Analyses 56415.3 Updating Based on Pilot Test Data 56615.3.1 Posterior Distributions 56615.3.2 Some Observations on Model Fit 57515.3.3 A Quick Validity Check 57715.4 Conclusion 579

16 The Future of Bayesian Networks in Educational

Assessment 583

16.1 Applications of Bayesian Networks 58316.2 Extensions to the Basic Bayesian Network Model 58616.2.1 Object-Oriented Bayes Nets 58616.2.2 Dynamic Bayesian Networks 58816.2.3 Assessment-Design Support 59216.3 Connections with Instruction 59316.3.1 Ubiquitous Assessment 59416.4 Evidence-Centered Assessment Design and Validity 59616.5 What We Still Do Not Know 597

A Bayesian Network Resources 601

A.1 Software 601A.1.1 Bayesian Network Manipulation 602A.1.2 Manual Construction of Bayesian Networks 603A.1.3 Markov Chain Monte Carlo 603A.2 Sample Bayesian Networks 604

References 607 Author Index 639

Trang 24

List of Figures

1.1 A graph for the Language Testing Example 52.1 The principle design objects of the conceptual assessment

framework (CAF) 272.2 The proﬁciency model for a single variable, Proﬁciency Level 28

2.3 The measurement model for a dichotomously scored item 312.4 The four principle processes in the assessment cycle 353.1 Canonical experiment: balls in an urn 423.2 Graph for Feller’s accident proneness example 553.3 Unidimensional IRT as a graphical model 563.4 Variables θ1 and θ2are conditionally dependent given X 56

3.5 Examples of discrete and continuous distributions a Discrete distribution b Continuous distribution 59

3.6 Likelihood for θ generated by observing 7 successes in 10 trials 65

3.7 A panel of sample beta distributions 683.8 A panel of sample gamma distributions 734.1 A simple undirected graph 834.2 A directed graph 844.3 A tree contains no cycles 854.4 Examples of cyclic and acyclic directed graphs 854.5 Filling-in edges for triangulation Without the dotted edge,

this graph is not triangulated Adding the dotted edge makesthe graph triangulated 864.6 Directed Graph for

P(A)P(B)P(C |A, B)P(D|C)P(E|C)P(F |E, D) 87

4.7 Example of a hypergraph (a) and its 2-section (b) 89

4.8 Hypergraph representing

P(A)P(B)P(C |A, B)P(D|C)P(E|C)P(F |E, D) 89

4.9 2-section of Fig 4.8 90

Trang 25

XXVI List of Figures

4.10 D-Separation 934.11 Directed graph running in the “causal” direction: P(Skill)

5.2 The acyclic digraph and junction tree for a four-variable chain 1125.3 A junction tree corresponding to a singly-connected graph 1195.4 A polytree and its corresponding junction tree 1195.5 A loop in a multiply connected graph 1245.6 The tree of cliques and junction tree for Figure 5.5 1245.7 Acyclic digraph for two-skill example (Example 5.5) 1255.8 Moralized undirected graph for two-skill example (Example 5.5) 1295.9 Two ways to triangulate a graph with a loop 1305.10 Cliques for the two-skill example 1305.11 Junction tree for the two-skill example 1325.12 Relationship among proﬁciency model and evidence model

Bayes net fragments 1385.13 Total acyclic digraph for three-task test 1415.14 Proficiency model fragments for three-task test 1415.15 Evidence model fragments for three-task test 1415.16 Moralized proficiency model graph for three-task test 1425.17 Moralized evidence model fragments for three-task test 1426.1 Model graph for five item IRT model 1606.2 The initial probabilities for the IRT model in Netica The

numbers at the bottom of the box for the Theta node

represent the expected value and standard deviation of Theta 1626.3 a Student with Item 2 and 3 correct b Student with Item 3

and 4 correct 1646.4 Probabilities conditioned on θ = 1 165

6.5 Five item IRT model with local dependence 1686.6 a Student with Item 2 and Item 3 correct with context eﬀect.

b Student with Item 3 and Item 4 correct with context eﬀect 169

6.7 Three diﬀerent ways of modeling observable with two parents 1736.8 Initial probabilities for three distribution types 1746.9 a Updated probabilities when Observation = Right.

b Updated probabilities when Observation = Wrong 175

6.10 a Updated probabilities when P1 = H and

Observation = Right.

b Updated probabilities when P1 = H and Observation = Wrong 176

Trang 26

List of Figures XXVII

6.11 a Updated probabilities when P1 = M and

Observation = Right.

b Updated probabilities when P1 = M and Observation = Wrong 177

6.12 Proﬁciency model for Method B for solving mixed number

subtraction 1816.13 Two evidence model fragments for evidence models 3 and 4 1826.14 Full Bayesian model for Method B for solving mixed numbersubtraction 1836.15 Prior (population) probabilities 1856.16 Mixed number subtraction: a sample student 1866.17 Mixed number subtraction: posterior after 7 Items 1886.18 Mixed number subtraction: Skill 1 = Yes 189

6.19 Mixed number subtraction: Skill 1 = No 191

6.20 Mixed number subtraction: Skill 1 = Yes, Skill 2 = No 192

6.21 Updated probabilities when P1 = L and Observation = Wrong 194

7.1 Colored graph for an examinee who reads well but has troublespeaking 1997.2 Evidence balance sheet for θ ≥ 1 (Example 7.2) 203

7.3 Evidence balance sheet for Reading = Novice (Example 7.1) 204

7.4 Evidence balance sheet for Listening = Novice (Example 7.1) 206

7.5 Evidence balance sheet for Speaking = Novice (Example 7.1) 207

7.6 Evidence flows using weight of evidence 2087.7 Proficiency model for ACED assessment 2197.8 Subset of ACED proficiency model for exercises 2348.1 A simple latent class model with two skills 2428.2 Hypergraph for a simple latent class model with two skills 2428.3 Second layer of parameters on top of graphical model 2438.4 Introducing the demographic variable Grade breaks global

parameter independence 2458.5 A conjunctive model, with no “noise” 2508.6 A conjunctive model with noisy output (DINA) 2518.7 A conjunctive model with noisy inputs (NIDA) 2528.8 A noisy conjunctive model, with noisy inputs and outputs 2538.9 Separable inﬂuences 2558.10 Midpoints of intervals on the normal distribution 2568.11 Probabilities for Graded Response model 2628.12 Output translation method 2649.1 Expanded and plate digraphs for four Bernoulli variables 2819.2 Plate digraph for hierarchical Beta-Bernoulli model 2829.3 Plate digraph for hierarchical Rasch model 2839.4 Graph for generic measurement model 2869.5 Graph with covariates and no DIF 287

Trang 27

XXVIII List of Figures

9.6 Graph with DIF 2889.7 Graph for mixture model 2889.8 Graph for three-item latent class model 2929.9 Examples of poor mixing and good mixing 3119.10 Convergence of three Markov chains 3119.11 Plot of the Brook–Gelman–Rubin R vs cycle number 312

9.12 A Metropolis step that will always be accepted 3149.13 A Metropolis step that might be rejected 3149.14 Posteriors for selected parameters 32110.1 Two alternative discrete IRT models with and without context

eﬀect a Conditionally independent IRT model b Testlet model 339

10.2 Stem-and-leaf plots of posterior predictive probabilities for Q3

values of item pairs 34110.3 Q3 values for item pairs 341

10.4 Observable characteristic plot 34410.5 Observable characteristic plot for additional skill 34510.6 Direct data display for mixed number subtraction test 34710.7 Two graphs showing the eﬀects of diﬀerential task functioning

a Graph with no DTF b Graph with DTF 348

10.8 Graphs (a), (b), and (c) have identical independence

structures, but Graph (d) does not 35910.9 Four graphs which could have identical distributions on the

observed variables A and C a No hidden cause, b common

cause, c intermediate cause, d partial cause 359

10.10 Selection eﬀect produces apparent dependence among observed

variables a No selection eﬀect b Selection eﬀect 360

10.11 A minimal graph which should not be interpreted as causal 36110.12 Inclusion of an additional variable changes picture dramatically 36210.13 Two candidate models for a hypothetical medical licensure

assessment a Model A b Model B 363

10.14 Observable characteristic plot for Exercise 10.13 36810.15 Observable characteristic plot for Exercise 10.14 36811.1 Proﬁciency model for mixed-number subtraction, method B 37311.2 Evidence model fragments for evidence models 3 and 4 37511.3 Mixed number subtraction Bayes net 37511.4 Plate representation of the parameterized mixed-number

subtraction model 38211.5 Gelman–Rubin potential scale reduction factors for selected

parameters 38611.6 Histories of MCMC chains for selected parameters 38711.7 Posterior distributions from MCMC chains for selected

parameters 38911.8 Observable characteristic plots for ﬁrst eight items 399

Trang 28

List of Figures XXIX

11.9 Observable characteristic plots for last seven items 40012.1 The principal design objects of the CAF 41312.2 Toulmin’s structure for arguments 42012.3 Partial language testing proﬁciency model showing a part-ofrelationship 42912.4 Proﬁciency model for language testing using only the four

modal skills 43012.5 Proﬁciency model for language with additional communicativecompetence skills 43112.6 Prototype score report for Language Placement Test 43612.7 Evidence model for lecture clariﬁcation task for use with

modal proﬁciency model 45013.1 The four-process architecture for assessment delivery 47013.2 Assessment blueprint for a small test 47413.3 Tasks generated from CAF in Fig 13.2 47513.4 A sample receiver operating characteristic (ROC) curve 48013.5 Link model generated from Task 1a in Fig 13.3 48713.6 Absorbing evidence from Task 1a for a single candidate 49013.7 ACED proﬁciency model 49514.1 A concept map for mechanisms of evolution 51114.2 Representation of a standards-based domain for assessment

design 51214.3 The biomass proﬁciency model 51614.4 Biomass: a sample score report from the Biomass classroom

assessment 52014.5 Biomass: the introductory screen 52314.6 Biomass: background for ﬁrst task 52614.7 Biomass: ﬁrst task is to complete a table for allele

representation of mode of inheritance 52714.8 Biomass: second task, population attribute table 52814.9 Biomass: fourth task, what to do next? 52914.10 An evidence model Bayes net fragment with seven observables 53314.11 An evidence model using three observables and a context eﬀect 53414.12 An evidence model fragment with three conditionally-

independent observables 53414.13 An evidence model fragment using the inhibitor relationship 53516.1 A basic dynamic Bayesian network 58916.2 A learning-model dynamic Bayesian network 59016.3 Instruction as a partially observed Markov decision process 59116.4 Inﬂuence diagram for skill training decision 593

Trang 29

List of Tables

5.1 Updating probabilities in response to learning Y = y1 109

5.2 Numerical example of updating probabilities 1105.3 Updating probabilities down a chain 1175.4 Updating probabilities up a chain 1185.5 Updating with virtual evidence 1215.6 Probabilities for Example 5.5 1275.7 Potential tables for the two-skill example 1325.8 Updating the potential tables for{θA , θ B , X2} 136

5.9 Q-Matrix for design leading to saturated model 147

5.10 Q-Matrix for Fig 5.13, one row per task 148

5.11 Q-Matrix for Fig 5.13, one row per observable 148

5.12 Q-Matrix for proposed assessment (Exercise 5.12) 154

6.1 Conditional probabilities of a correct response for the ﬁve-itemIRT model 1606.2 Initial marginal probabilities for ﬁve items from IRT model 1626.3 New potentials for Item 3 and Item 4, conditioned on Context 170

6.4 Conditional probabilities for the three distributions 1746.5 Q-Matrix for the Tatsuoka (1984) mixed number subtraction

test 1796.6 Conditional probability table for Skill 5 184

6.7 Conditional probability table CPT for Item 16 184

6.8 Posteriors after two sets of observations 1876.9 Predictions for various skill patterns Subtraction assessment 1906.10 Potentials for Exercise 6.17 1957.1 Accuracy matrices for Reading and Writing based on 1000

simulated students 2277.2 Expected accuracy matrices based on 1000 simulations 2327.3 Conditional probabilities for ACED subset proﬁciency model 2347.4 Conditional probabilities for evidence models for ACED subset 234

Trang 30

XXXII List of Tables

7.5 Alternative conditional probabilities for ACED subset

proﬁciency model 2357.6 Accuracy matrices for Speaking and Listening based on 1000

simulated students 2378.1 Eﬀective thetas for a compensatory combination function 2588.2 Eﬀective thetas for the conjunctive and disjunctive

combination functions 2598.3 Eﬀective thetas for inhibitor combination functions 2608.4 Conditional probability table for simple graded response model 2638.5 Compensatory combination function and graded response linkfunction 2638.6 Conditional probability table with normal link function,

correlation = 0.8 2668.7 Conditional probability table for path coeﬃcients 0.58 and 0.47 2679.1 Special cases of the generic measurement model 2859.2 Prior and posterior statistics for beta distribution with r

successes in n trials 291

9.3 Response pattern counts with proﬁciency variable, θ 295

9.4 Response pattern counts collapsing over proﬁciency variable, θ 295

9.5 E-step probabilities for iterations 1, by response pattern 3049.6 E-step expected response pattern counts 3049.7 E-step iteration 1 expectations of suﬃcient statistics 3059.8 Trace of EM parameter estimates 3069.9 MCMC cycle response pattern counts 3189.10 MCMC parameter draws from intervals of 1000 and summarystatistics 3209.11 Approximating Dirichlet priors from posterior means and

standard deviations for π31 3229.12 Response pattern counts for online calibration 3239.13 Average parameter values from initial and Online calibrations 32410.1 Actual and predicted outcomes for the hypothetical medicallicensure exam 36410.2 Logarithmic scores for ten student outcome vectors 36510.3 Deviance values for two ACED models 36610.4 Observed outcome for two items for Exercise 10.10 36711.1 Skill requirements for fraction subtraction items 37411.2 Equivalence classes and evidence models 37611.3 Summary statistics for binary-skills model 39011.4 Selected student responses 39111.5 Prior and posterior probabilities for selected examinees 39211.6 Summary statistics for binary-skills model, Admin 1 394

Trang 31

List of Tables XXXIII

11.7 Summary statistics for binary-skills model, Admin 2 39511.8 Item-ﬁt indices for the mixed-number subtraction test 40411.9 Person-ﬁt p-values for selected students 404

13.1 Summary of the four processes 46913.2 Confusion matrix for binary proﬁciency and observable 47913.3 Expected accuracy matrix for observable PC3 in Task Exp4.1 483

13.4 MAP accuracy matrix for Task Exp4.1 48313.5 MAP accuracy matrix for Task Exp6.1 48313.6 Three-way table of two observables given proficiency variable 48413.7 Three-way table of two observables given marginal proficiency 48513.8 Calculation of expected weight of evidence 49613.9 Data for Exercise 13.4 50213.10 Ten randomly selected entries from a set of pretest data 50213.11 Expected accuracy matrix for two experimental tasks 50313.12 Expected accuracy matrix (normalized) for two multiple-choicetasks 50313.13 Data for differential task functioning detection problem

(Exercise 13.8) 50313.14 Data for conditional independence test problem (Exercise 13.9) 50413.15 Calculation of expected weight of evidence after one observation 50414.1 A hierarchical textual representation of science standards 50914.2 Potential observations related to scientiﬁc investigation 51414.3 Segments in the Biomass “mice” scenario 52214.4 Connecting knowledge representations with investigation steps 52514.5 Rules of evidence for table task 53215.1 Task and evidence models from the ﬁrst Biomass segment 55115.2 Initial conditional distributions for observables 2–7 of Task 1 55615.3 Initial conditional distributions for observable 1 of Task 2 55815.4 Initial conditional probability distributions for all three

observables of Task 3 55915.5 Initial conditional distribution for observable 1 of Task 4 56015.6 Summary statistics of parameter prior distributions 56215.7 Observed responses 56415.8 Summary statistics of prior and posterior population

parameter distributions 56715.9 Summary statistics of item parameter distributions 56815.10 Prior and posterior expected proﬁciency levels 56915.11 Revised conditional distributions for observable 3 of Task 1 57115.12 Revised conditional probability table for observable 4 of Task 1 57215.13 Revised conditional distributions for observable 1 of Task 2 57315.14 A set of simulated preposterior predictive responses 576

Trang 32

Part I

Building Blocks for Bayesian Networks

Trang 33

Introduction

David Schum’s 1994 book, The Evidential Foundations of Probabilistic

Rea-soning, changed the way we thought about assessment Schum, a psychologist

cum legal scholar, was writing about evidence in the most familiar meaning ofthe word, looking at a lawyer’s use of evidence to prove or disprove a propo-sition to a jury However, Schum placed that legal deﬁnition in the context

of the many broader uses of the term “evidence” in other disciplines Schumnotes that scientists and historians, doctors and engineers, auto mechanics,and intelligence analysts all use evidence in their particular ﬁelds From theircross-disciplinary perspectives, philosophers, statisticians, and psychologistshave come to recognize basic principles of reasoning from imperfect evidencethat cut across these ﬁelds

Mislevy (1994) shows how to apply the idea of evidence to assessment

in education Say, for example, we wish to show that a student having pleted a reading course, is capable of reading, with comprehension, an article

com-from The New York Times We cannot open the student’s brain and observe

directly the level of comprehension, but we can ask the student questionsabout various aspects of articles she reads The answers provide evidence ofwhether or not the student comprehended what she read, and therefore hasthe claimed skill

Schum (1994) faced two problems when developing evidential arguments inpractical settings: uncertainty and complexity We face those same problems ineducational assessment and have come to adopt the same solutions: probabilitytheory and Bayesian networks

Schum (1994) surveys a number of techniques for representing impreciseand uncertain states of knowledge While approaches such as fuzzy sets, belieffunctions, and inductive probability all oﬀer virtues and insights, Schum grav-itates to probability theory as a best answer Certainly, probability has hadthe longest history of practical application and hence is the best understood.Although other systems for representing uncertain states of knowledge, such asDempster–Shafer models (Shafer1976; Almond1995), may provide a broader

c

Springer Science+Business Media New York 2015 3

R G Almond et al., Bayesian Networks in Educational Assessment,

Statistics for Social and Behavioral Sciences, DOI 10.1007/978-1-4939-2125-6 1

Trang 34

In complex situations, it can be diﬃcult to calculate the probability of anevent, especially if there are many dependencies The solution is to draw apicture A graphical model, a graph whose nodes represent the variables andwhose edges represent dependencies between them, provides a guide for bothconstructing and computing with the statistical models Graphical models inwhich all the variables are discrete, have some particular computational advan-tages These are also known as Bayesian networks because of their capacity torepresent complex and changing states of information in a Bayesian fashion.Pearl (1988) popularized this approach to represent uncertainty, especially

in the artiﬁcial intelligence community Since then, it has seen an explosivegrowth

This book explores the implications of applying graphical models to cational assessment This is a powerful technique that supports the use ofmore complex models in testing, but is also compatible with the models andtechniques that have been developing in psychometrics over the last century.There is an immediate beneﬁt of enabling us to build models which are closer

edu-to the cognitive theory of the domain we are testing (Pelligrino et al 2001).Furthermore, this approach can support the kind of complexity necessary fordiagnostic testing and complex constructed response or interactive tasks.This book divides the story of Bayesian networks in educational assess-ment into three parts PartIdescribes the basics of properties of a Bayesiannetwork and how they could be used to accumulate evidence about the state

of proﬁciency of a student Part IIdescribes how Bayesian networks can beconstructed, and in particular, how both the parameters and structure can

be reﬁned with data Part III ties the mathematics of the network to theevidence-centered assessment design (ECD) framework for developing assess-ments and contains an extensive and detailed example The present chapterbrieﬂy explores the question of why Bayesian networks provide an interestingchoice of measurement model for educational assessments

1.1 An Example Bayes Network

Bayesian networks are formally deﬁned in Chap.4, but a simple example willhelp illustrate the basic concepts

Trang 35

1.1 An Example Bayes Network 5

Example 1.1 (Language Testing Example) (Mislevy 1995c ) Imagine a language assessment which is designed to report on four proﬁciency variables: Reading , Writing , Speaking and Listening This assessment has four types

of task: (1) a reading task, (2) a task which requires both writing and ing, (3) a task which requires speaking and either reading or listening, and (4) a listening task Evaluating the work product (selection, essay or speech) produces a single observable outcome variable for each task These are named

read-Outcome R, read-Outcome RW, read-Outcome RLS, and read-Outcome L respectively.

ReadingWriting

Speaking

Listening

Outcome ROutcome RWOutcome RSL

Outcome L

Fig 1.1 A graph for the Language Testing Example

A Bayesian network for the language test example of Mislevy (1995c) Rounded

rectangles in the picture represent variables in the model Arrows (“edges”) represent

patterns of dependence and independence among the variables This graph provides

a visual representation of the joint probability distribution over the variables in thepicture Reprinted with permission from Sage Publications

Figure 1.1 shows the graph associated with this example Following tions from ECD (cf Chaps 2 and 12 ), the nodes (rounded rectangles in the graph) for the proﬁciency variables are ornamented with a circle, and the nodes for the evidence variables are ornamented with a triangle The edges

conven-in the graph flow from the proficiency variables to the observable variables for tasks which require those proficiencies Thus the graph gives us the information about which skills are relevant for which tasks, providing roughly the same information that a Q-Matrix does in many cognitively diagnostic assessment models (Tatsuoka 1983 ).

The graphs used to visualize Bayesian networks, such as Fig.1.1, act as

a mechanism for visualizing the joint probability distribution over all of thevariables in a complex model, in terms of theoretical and empirical relation-ships among variables This graphical representation provides a shared work-ing space between subject matter experts who provide insight into the cogni-tive processes underlying the assessment, and psychometricians (measurementexperts) who are building the mathematical model In that sense, Bayesian

Trang 36

As the name implies, Bayesian networks are based on Bayesian views ofstatistics (see Chap 3 for a review) The key idea is that a probability dis-tribution holds a state of knowledge about an unknown event As Bayesiannetworks represent a probability distribution over multiple variables, theyrepresent a state of knowledge about those variables.

Usually, the initial state of a Bayesian network in educational assessment isbased on the distribution of proficiency in the target population for an assess-ment and the relationship between those proficiencies and task outcomes in thepopulation The probability values could have come from theory, expert opin-ion, experiential data, or any mixture of the three Thus, the initial state ofthe Bayes net in Fig.1.1represents what we know about a student who entersthe testing center and sits down at a testing station to take the hypotheticallanguage test: the distribution of proficiencies in the students we typically see,and the range of performance we typically see from these students

As the student performs the assessment tasks, evaluating the work ucts using the appropriate evidence rules yields values for the observable out-come variables The values of the appropriate variables in the network arethen instantiated or set to these values, and the probability distributions

prod-in the network are updated (by recursive applications of Bayes rule) Theupdated network now represents our state of knowledge about this studentgiven the evidence we have observed so far This is a powerful paradigm forthe process of assessment, and leads directly to mechanisms for explainingcomplex assessments and adaptively selecting future observations (Chap 7).Chapter 13 describes how this can form the basis for an embedded scoringengine in an intelligent tutoring or assessment system

The fact that the Bayes net represents a complete Bayesian probabilitymodel has another important consequence: such models can be critiqued andreﬁned from data Complete Bayesian models provide a predictive probabilityfor any observable pattern of data Given the data pattern, the parameters ofthe model can be adjusted to improve the ﬁt of the model Similarly, alterna-tive model structures can be proposed and explored to see if they do a betterjob of predicting the observed data Chapters 9 11 explore the problems ofcalibrating a model to data and learning model structure from data

Trang 37

Bayesian network models for assessments are especially powerful whenused in the context of ECD (Mislevy et al.2003b) Chap.2gives a brief intro-duction to some of the language used with ECD, while Chap.12explains it inmore detail The authors have used ECD to design a number of assessment,and our experience has caused us to come to value Bayesian networks fortwo reasons First, they are multivariate models appropriate for cognitivelydiagnostic assessment (Sect 1.2) Second, they help assessment designers toexplicitly draw the connection between measurement models and cognitivemodels that underlie them (Sect 1.3)

1.2 Cognitively Diagnostic Assessment

Most psychometricians practicing today work with high-stakes tests designedfor selection, placement, or licensing decisions This is no accident Errorsand ineﬃciencies in such tests can have high costs, both social and mone-tary, so it is worthwhile to employ someone to ensure that the reliability andvalidity of the resulting scores are high However, because of the prominence

of selection/placement tests, assumptions based on the selection/placementpurpose and the high stakes are often embedded in the justiﬁcation for par-ticular psychometric models It is worth examining closely the assumptionswhich come from this purpose, to tease apart the purposes, the statistics, andthe psychology that are commingled in familiar testing practices

First, it is almost always good for a selection/placement assessment to

be unidimensional The purpose of a college admission officer looking at anassessment is to rank order the candidates so as to be better able to makedecisions about who to admit This rank ordering implies that the admissionsofficer wants the candidates in a single line The situation with licensure andcertification testing is similar; the concern is whether or not the candidatemakes the cut, and little else

Because of the high stakes, we are concerned with maximizing the validity

of the assessment—the degree to which it provides evidence for the claims wewould like to make about the candidate For selection and placement situa-tions, a practically important indicator of validity is the degree to which thetest correlates with a measure of the success after the selection or placement

Test constructors can increase this correlation by increasing the reliability of

the assessment—the precision of the measurement, or, roughly, the degree towhich the test is correlated with itself This can lead them to discard itemswhich are not highly correlated with the main dimension of the test, even ifthey are of interest for some other reason

Although high-stakes tests are not necessarily multiple choice, multiplechoice items often play a large role in them This is because multiple choice isparticularly cost eﬀective The rules of evidence—procedures for determiningthe observable outcome variables—for multiple choice items are particularlyeasy to describe and eﬃcient to implement With thoughtful item writing,

Trang 38

8 1 Introduction

multiple choice items can test quite advanced skills Most importantly, theytake little of the student’s time to answer A student can solve 20–30 multi-ple choice items in the time it would take to answer a complex constructedresponse task like an essay, thus increasing reliability While a complex con-structed response task may have a lower reliability than 20–30 multiple choiceitems, it may tap skills (e.g., generative use of language) that are diﬃcult tomeasure in any other way Hence the complex constructed response item canincrease validity even though it decreases reliability

However, the biggest constraints on high stakes testing come from securityconcerns With high stakes comes incentive to cheat, and the measures tocircumvent cheating are costly These range from proctoring and verifying theidentity of all candidates, to creating alternative forms of the test The last ofthese produces a voracious appetite for new items as old ones are retired Italso necessitates the process of equating between scores on alternative forms

of the test

Increasingly, the end users of tests want more than just a single score touse for selection or placement They are looking for a set of scores to helpdiagnose problems the examinee might be facing This is an emerging ﬁeld

called cognitively diagnostic assessment (Leighton and Gierl2007; Rupp et al

2010) The “cognitive” part of this name indicates that scores are chosen toreﬂect a cognitive model of how students acquire skills (see Sect 1.3) The

“diagnostic” part of the name reflects a phenomenon that seeks to identifyand provide remedy for some problem in a students’ state of proficiency Suchdiagnostic scores can be used for a variety of purposes: as an adjunct to a highstakes test to help a candidate prepare, as a guidance tool to help a learnerchoose an appropriate instructional strategy, or even shaping instructions onthe fly in an intelligent tutoring system Often these purposes carry muchlower stakes, and hence less stringent requirements for security

Nowhere is the interplay between high stakes and diagnostic assessment

more apparent than in the No Child Left Behind (NCLB) Act passed by the

U.S Congress in 2002 and the Race To the Top program passed as part ofthe American Reinvestment and Recovery Act of 2009 The dual purpose ofassessments—accountability and diagnosis at some level—remains a part ofthe U.S educational landscape Under these programs, all children are tested

to ensure that they are meeting the state standards Schools must be makingadequate progress toward bringing all students up to the standards This, in

turn, means that educators are very interested in why students are not yet

meeting the standards and what they can do to close the gap They needdiagnostic assessment to supplement the required accountability tests to helpthem identify problems and choose remedies

When we switch the purpose from selection to diagnosis, everythingchanges First and foremost, a multidimensional concept of proﬁciency usuallyunderlies cognitively diagnostic scoring (A metaphor: Whereas for a selectionexam we might have been content with knowing the volume of the examinee,

Trang 39

in a diagnostic assessment we want to distinguish examinees who are tall butnarrow and shallow from those who are short, wide and shallow and thosewho are short, narrow and deep.) As a consequence, the single score becomes

a multidimensional proﬁle of student proﬁciency Lou DiBello (personal

com-munication) referred to such tests as proﬁle score assessments.

The most important and diﬃcult part of building a multidimensionalmodel of proﬁciency is identifying the right variables The variables (or suit-able summaries) must be able to produce scores that the end users of the

assessment care about: scores which relate to claims we wish to make about

the student and educational decisions that must be made That is, it is notenough that the claims concern what students know and can do; they must

be organized in ways that help teachers improve what they know and can do

A highly reliable and robust test built around the wrong variables will not beuseful to end users and consequently will fall out of use

Another key diﬀerence between a single score test and a proﬁle score test

is that we must specify how each task outcome depends on the proficiencyvariables In a profile score assessment, for each task outcome, we must answerthe questions “What proficiencies are required?”; “How are they related inthese requirements?”; and “To what degree are they involved?” This is thekey to making the various proficiency variables identifiable In a single scoreassessment, each item outcome loads only onto the main variable; the onlyquestion is with what strength Consequently, assessment procedures thatare tuned to work with single score assessments will not provide all of theinformation necessary to build a profile score test

Suppes (1969) introduced a compact representation of the relationship

between proﬁciency and outcome variables for a diagnostic test called the Matrix In the Q-Matrix, columns represent proﬁciency variables and rows

Q-represent items (observable outcome variables) A one is placed in the cellswhere the proﬁciency is required for the item, and a zero is placed in the othercells Note that an alternative way to represent graphs is through a matrixwith ones where an edge is present and zero where there is no edge Thus, there

is a close connection between Bayesian network models and other diagnostic

models which use the Q-Matrix (Tatsuoka 1983; Junker and Sijtsma 2001;Roussos et al.2007b)

The situation can become even more complicated if the assessment includescomplex constructed response tasks In this case, several aspects of a student’swork can provide evidence of diﬀerent proﬁciencies Consequently, a task mayhave multiple observable outcomes For example, a rater could score an essay

on how well the candidate observed the rules of grammar and usage, how wellthe candidate addressed the topic, and how well the candidate structured theargument These three outcomes would each draw upon diﬀerent subsets ofthe collection of proﬁciencies measured by the assessment

Some of the hardest work in assessment with complex constructed responsetasks goes into deﬁning the scored outcome variables Don Melnick, who for

Trang 40

10 1 Introduction

several years led the National Board of Medical Examiners (NBME) project

on computer-based case management problems, observed “The NBME hasconsistently found the challenges in the development of innovative testingmethods to lie primarily in the scoring arena Complex test stimuli result incomplex responses which require complex models to capture and appropriatelycombine information from the test to create a valid score” (Melnick 1996,

p 117)

The best way to do this is to design forward We do not want to wait for adesigner to create marvelous tasks, collect whatever data result, and throw itover the wall for the psychometrician to ﬁgure out “how to score it.” The mostrobust conclusion from the cognitive diagnosis literature is this: Diagnosticstatistical modeling is far more eﬀective when applied in conjunction with taskdesign from a cognitive framework that motivates both task construction andmodel structure, than when applied retrospectively to existing assessments(Leighton and Gierl2007)

Rather, we start by asking what we can observe that will provide evidence

that the examinee has the skill we are looking for We build situations withfeatures that draw on those skills, and call for the examinee to say, do, or

make something that provides evidence about them—work products We call the key features of this work observable outcome variables, and the rules for computing them, rules of evidence For example, in a familiar essay test the

observable outcomes are the one or more scores assigned by a rater, and therules of evidence are the rubrics the rater uses to evaluate the essay as to itsqualities

A richer example is HYDRIVE (Gitomer et al.1995), an intelligent ing system built for the US Air Force and designed to teach troubleshoot-ing for the hydraulics systems of the F-15 aircraft An expert/novice study

tutor-of hydraulics mechanics revealed that experts drew on a number tutor-of bleshooting strategies that they could bring to bear on problems (Steinbergand Gitomer 1996) For example, they might employ a test to determinewhether the problem was in the beginning or end of a series of componentsthat all had to work for a ﬂap to move when a lever was pulled This strategy

trou-is called “space splitting” because it splits the problem space into two parts(Newell and Simon 1972) HYDRIVE was designed to capture informationnot only about whether or not the mechanic correctly identiﬁed and repairedthe problem, but also about the degree to which the mechanic employed eﬃ-cient strategies to solve the problem Both of these were important observableoutcomes

However, when there are multiple aspects of proficiency and tasks can havemultiple outcomes, the problem of determining the relationships between pro-ficiencies and observable variables becomes even harder In HYDRIVE, bothknowledge of general troubleshooting strategies and the specific system beingrepaired were necessary to solve most problems Thus each task entailed amany-to-many mapping between observable outcomes and proficiency vari-ables

Định dạng
Số trang	682
Dung lượng	14,19 MB