Evaluation of the achievement levels for mathematics and reading on the national assessment of educational progress

Koenig, Essessment Statistics ciences and E for Mathem Editors Education matics and Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Ed

Trang 2

ation of NAE

er Edley, Jr.,Board on TesCommittee oBehavioral an

EP AchievemReading , and Judith Asting and Asand

on National S

nd Social Sc

ment Levels

A Koenig, Essessment Statistics ciences and E

for Mathem

Editors

Education

matics and

Evaluation of the Achievement Levels

for Mathematics and Reading

on the National Assessment

of Educational Progress

Trang 3

This activity was supported by Contract No ED-IES-14-C-0124 from the U.S

Department of Education Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project

International Standard Book Number-13: 978-0-309-43817-9

International Standard Book Number-10: 0-309-43817-9

Digital Object Identifier: 10.17226/23409

Additional copies of this report are available for sale from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu

Printed in the United States of America

Suggested citation: National Academies of Sciences, Engineering, and Medicine (2016)

Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress Washington, DC: The National Academies Press

doi: 10.17226/23409

Trang 4

cademy of

National Acadvising thcontribution

cademy of

1970 under

on on mediguished co

emies work

nd Medicin

ation and clic policy dresearch, runderstand

out the Nat

ww.nationa

rrected proo

f Sciences w

dent Lincolation on issheir peers fdent

k together

ne to provid

conduct othdecisions Tecognize oding in mat

ng was esta

Sciences toMembers areering Dr

(formerly t

er of the Naalth issues

s to medicin

as the Nati

de indepenher activitieThe Nationautstanding tters of scie

emies of Sc

es.org

ished in 186vate, nongo

ed to sciencnding contri

63 by an Acovernmenta

ce and techibutions to

n 1964 undepractices o

by their pe

e, Jr., is pr

te of Medicdemy of Scare electedalth Dr Vic

emies of S

ective analycomplex p

es also encons to knowneering, anngineering,

ct of

al hnology research

er the

of ers for resident

ine) was ciences to

d by their ctor J Dzau

Sciences,

ysis and problems courage wledge, and

nd and

u

d

Trang 5

er reviewedneering, andhronicle theother convare those onts, the planeering, and

n about othionalacade

idence-baseinclude finhered by th

d and are a

d Medicine

e presentatening even

of the partianning com

d Medicine

er productemies.org/w

ed consensndings, con

he committeapproved by

tions and di

nt The staticipants andmmittee, or

s and activwhatwedo

sus of an aunclusions, a

ee and com

y the Natio

iscussions atements an

d have not

r the Nationvities of the

uthoring coand recommmmittee deonal Academ

at a worksh

d opinions been endonal Academ

e National A

mmittee ofmendations eliberationsmies of

hop, contained orsed by mies of Academies,

f

s

,

Trang 6

Prepublication copy- Uncorrected proofs

COMMITTEE ON THE EVALUATION OF NAEP ACHIEVEMENT LEVELS

CHRISTOPHER EDLEY, JR (Chair), School of Law, University of California,

Berkeley School of Law

PETER AFFLERBACH, Department of Teaching and Learning, Policy and Leadership, University of Maryland

SYBILLA BECKMANN, Department of Mathematics, University of Georgia

H RUSSELL BERNARD, Institute for Social Science Research, Arizona State

University, and Department of Anthropology, University of Florida

KARLA EGAN, National Center for the Improvement of Educational Assessment,

Dover, NH

DAVID J FRANCIS, Department of Psychology, University of Houston

MARGARET E GOERTZ, Graduate School of Education, University of Pennsylvania

(emerita)

LAURA HAMILTON, RAND Education, RAND Corporation, Pittsburgh, PA

BRIAN JUNKER, Department of Statistics Carnegie Mellon University

SUZANNE LANE, School of Education, University of Pittsburgh

SHARON J LEWIS, Council of the Great City Schools, Washington, DC (retired) BERNARD L MADISON, Department of Mathematics, University of Arkansas

SCOTT NORTON, Standards, Assessment, and Accountability, Council of Chief State

School Officers, Washington, DC

SHARON VAUGHN, College of Education, University of Texas at Austin

LAURESS WISE, Human Resources Research Organization, Monterey, CA

JUDITH KOENIG, Study Director

JORDYN WHITE, Program Officer

NATALIE NIELSEN, Acting Board Director (until June 2015)

PATTY MORISON, Acting Board Director (from June 2015)

KELLY ARRINGTON, Senior Program Assistant

Trang 7

BOARD ON TESTING AND ASSESSMENT

DAVID J FRANCIS, (Chair) Texas Institute for Measurement, Evaluation, and

Statistics, University of Houston

MARK DYNARSKI, Pemberton Research, LLC, East Windsor, NJ

JOAN HERMAN, National Center for Research on Evaluation, Standards, and Student

Testing, University of California, Los Angeles

SHARON LEWIS, Council of Great City Schools, Washington, DC

BRIAN STECHER, Education Program, RAND Corporation, Santa Monica, CA JOHN ROBERT WARREN, Department of Sociology, University of Minnesota

NATALIE NIELSEN, Acting Director (until June 2015)

PATTY MORISON, Acting Director (from June 2015)

Trang 8

COMMITTEE ON NATIONAL STATISTICS

LAWRENCE D BROWN (Chair), Department of Statistics, Wharton School,

University of Pennsylvania

JOHN M ABOWD, School of Industrial and Labor Relations, Cornell University FRANCINE BLAU, Department of Economics, Cornell University

MARY ELLEN BOCK, Department of Statistics, Purdue University

MICHAEL E CHERNEW, Department of Health Care Policy, Harvard Medical

School

DON A DILLMAN, Department of Sociology, Washington State University

CONSTANTINE GATSONIS, Center for Statistical Sciences, Brown University JAMES S HOUSE, Survey Research Center, Institute for Social Research, University

of Michigan

MICHAEL HOUT, Survey Research Center, University of California, Berkeley

THOMAS L MESENBOURG, U.S Census Bureau (retired)

SUSAN A MURPHY, Department of Statistics, University of Michigan

SARAH M NUSSER, Department of Statistics, Center for Survey Statistics and

Methodology, Iowa State University

COLM A O’MUIRCHEARTAIGH, Harris Graduate School of Public Policy Studies,

University of Chicago

RUTH D PETERSON, Criminal Justice Research Center, Ohio State University ROBERTO RIGOBON, Sloan School of Management, Massachusetts Institute of

Technology

EDWARD H SHORTLIFFE, Biomedical Informatics, Columbia University and

Arizona State University

CONSTANCE F CITRO, Director

BRIAN HARRIS-KOJETIN, Deputy Director

Trang 10

achievement goals for U.S students In response, NAEP adopted standards-based

reporting for the subjects and grades it assessed

Today, some 24 years later, the focus on achievement standards is even more intense At its best, standards-based reporting can provide a quick way to summarize students’ achievement and track their progress It can clearly demark disparities between what we expect our students to know and be able to do as articulated in the standards—and what they actually know and can do It can stimulate policy conversations about educational achievement across the country, identifying areas and groups with high performance as well as those with troubling disparities It can inform policy interventions and reform measures to improve student learning

There are potential downsides, however Standards-based reporting can lead to erroneous interpretations It can overstate or understate progress, particularly when the goal is to monitor the performance of subgroups In its attempt to be easily understood

by all audiences, it can lead to over-simplifications, such as when users do not do the necessary work to ensure their understandings are correct At a time when policy makers are vitally interested in ensuring that all of the country’s students achieve high standards,

it is critical that test results are reported in a way that leads to accurate and valid

interpretations, and that the standards used in reporting deserve the confidence of

policymakers, practitioners, and the public

The U.S Department of Education turned to the National Academies of Science, Engineering, and Medicine to answer these questions and the Academies set up the Committee on the Evaluation of NAEP Achievement Levels for Mathematics and

Reading The purpose of the project was to evaluate to what extent NAEP’s standards (or achievement levels) are reliable and valid Are they reasonable? Are they informative to the public? Do they lead to appropriate interpretations? The committee of 15 brought a broad range of education and assessment experience Our consensus findings,

conclusions, and recommendations are documented in this report

The committee benefited from the work of many others, and we wish to thank the many individuals who assisted us We first thank the sponsor who supported this work: the U.S Department of Education and the staff with the department’s Institute for

Education Sciences who oversaw our work, Jonathon Jacobson and Audrey Pendleton

During the course of its work, the committee met three times in person and four times by video conference The committee’s first meeting included a public session

Trang 11

designed to learn more about the sponsor’s goals for the project and the history of

achievement level setting for the National Assessment of Educational Progress (NAEP)

We heard from officials from the National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES) We especially thank Cornelia Orr,

former Executive Director, and Sharyn Rosenberg, Assistant Director for Psychometrics

with NAGB and Peggy Carr, acting commissioner of NCES, for the vast amount of historical information they provided

The committee’s second meeting included a public forum designed to provide an opportunity for committee members to hear first-hand accounts of the variety of ways NAEP users interpret and use the achievement levels These discussions were

enormously enlightening to the committee, and we thank all who participated This includes: Patte Barth, Sonja Brookins Santelises, Sarah Butrymowicz, Michael Casserly, Enis Dogan, Wendy Geiger, Catherine Gewertz, Renee Jackson, Scott Jenkins, Mike Kane, Jacqueline King, Lyndsey Layton, Nathan Olson, Emily Richmond, Bob Rothman, Lorrie Shepard, and Dara Zeehandlaar The agenda in Appendix A shows their titles and affiliations

The committee gratefully acknowledges the dedicated effort provided by the staff

of the Board on Testing and Assessment (BOTA) and the Committee on National

Statistics (CNSTAT) who worked directly on this project We first thank Natalie Nielsen, former acting director of BOTA, for her extensive efforts to make this project a reality

and her dedication in ensuring its success We thank Patricia Morison, current acting director of BOTA, for her support and guidance at key stages in this project We are

grateful to Connie Citro, director of CNSTAT and Robert Hauser, executive director of DBASSE, for their sage advice through the course of this project We thank Kelly

Arrington, senior program assistant, for her exceptional organizational skills and her close attention to detail Kelly handled all of the administrative details associated with the in-person and virtual meetings and the public forum, and she provided critical support in preparing the manuscript We are grateful to Jordyn White, program officer with

CNSTAT, for her adept research skills and her expertise in assimilating data and

designing understandable presentations We especially thank Judy Koenig for her

intellectual and organizational skills as the study director and for her work in assembling the critical information needed to prepare this report Judy worked tirelessly to keep us on

task and to ensure that we met the multiple challenges presented by our charge

We also thank members of the Office of Reports and Communication of the Division of Behavioral and Social Sciences for their dedicated work on this report We are indebted to Eugenia Grohman for her sage editorial advice on numerous versions of this manuscript We thank Kirsten Sampson Snyder for her work in coordinating a very intense review process and Yvonne Wise for shepherding the manuscript through

production

This report has been reviewed in draft form by individuals chosen for their

diverse perspectives and technical expertise, in accordance with procedures approved by the Report Review Committee of the National Academies of Science, Engineering, and Medicine The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as

possible and to ensure that the report meets institutional standards for objectivity,

Trang 12

evidence, and responsiveness to the study charge The review comments and draft

manuscript remain confidential to protect the integrity of the deliberative process

We thank the following individuals for their review of this report: Lisa M

Abrams, Department of Foundations of Education, School of Education, Virginia

Commonwealth University; Brian E Clauser, Educational Assessment, National Board of Medical Examiners; Sarah W Freedman, Graduate School of Education, University of California, Berkeley; Roger Howe, Department of Mathematics, Yale University; Nonie

K Lesaux, Graduate School of Education, Harvard University; Joseph A Martineau, Senior Associate, Center for Assessment, Dover, NH; William Penuel, School of

Education, University of Colorado Boulder; Barbara Plake, Boros Institute of Mental Measurements, Emeritus, University of Nebraska-Lincoln; Mark D Reckase, Department

of Counseling, Educational Psychology and Special Education, Michigan State

University; Lorrie A Shepard, School of Education, University of Colorado Boulder; Dale Whittington, Research and Accountability, Shaker Heights City School District, Shaker Heights, OH

Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations nor did they see the final draft of the report before its release The review of this report was overseen by Edward H Haertel, School of Education, Stanford University; and Ronald S Brookmeyer, Department of Biostatistics, University of California, Los Angeles

Appointed by the National Research Council, they were responsible for making certain that an independent examination of this report was carried out in accordance with

institutional procedures and that all review comments were carefully considered

Responsibility for the final content of this report rests entirely with the authoring

committee and the institution

As chair of the committee, I thank the extraordinary efforts of my fellow panel members, who had a broad range of expertise related to assessment, education policy, equity in education, mathematics and reading, program evaluation, social science, and statistics All of this expertise was critical to the multifaceted issues that had to be

addressed in this evaluation The panel members freely contributed their time to

accomplishing the myriad of tasks associated with assembling information and preparing this report They actively assisted in all stages of this project, including planning the meetings and the public forum, as well as writing and rewriting multiple versions of this report They gave generously of their time to ensure that the final product accurately represents our consensus findings, conclusions, and recommendations Their

contributions during the period in which the report was in final preparation and after the external review, when sections of the report had to be turned around on a very truncated schedule, are especially appreciated These efforts manifested the panel members’ deep dedication to improving student learning across the country

Christopher Edley, Jr., Chair

Trang 13

Committee on the Evaluation of NAEP Achievement Levels for Mathematics and Reading

Trang 14

Dedication

During the course of this project, the measurement community lost a towering figure Robert Linn, distinguished professor emeritus of education with the University of Colorado, Boulder, worked at the intersection of education measurement and education policy He was known for his enormous contributions to test equating, fairness, validity, performance assessment and educational accountability, and was widely respected for his ability to translate theory into practical advice His contributions to NAEP spanned several decades, and he continued to serve on the NAEP Validity Studies Panel until his final months His contributions to the NAS were numerous, serving as a board member and chair of BOTA, and a readily sought-after committee member and reviewer for many projects A prodigious scholar and remarkable intellectual leader, Bob was also

unceasingly patient and supportive as teacher, colleague, and friend We are indebted to his contributions and will miss him We dedicate this report to his memory

Trang 15

Evolution of NAEP Achievement Levels

Evolution of Standard Setting

Evaluations of NAEP’s Standard Settings Setting Cut Scores

Evolution of Achievement-Level Descriptors Summary

3 Setting Achievement Levels: NAEP Process

5 Validity of the Achievement Levels

Concepts of Validity and Validation

Content-Related Validity Evidence

Trang 16

6 Interpretations and Uses of NAEP Achievement Levels

Relevant Standards and Committee Sources

Intended Users, Interpretations, and Uses

NAEP Guidance for Users

Actual Uses of Achievement Levels

Potential Misinterpretations and Misuses of Achievement Levels Conclusions

Annex: Bibliography

7 Setting New Standards: Considerations

Options for Handling Change

Mathematics

Reading

Digitally Based Assessment

Weighing the Options

8 Conclusions and Recommendations

A Agenda for Public Forum

B Biographical Sketches of Committee Members and Staff

Trang 17

Summary

The National Assessment of Educational Progress (NAEP) has been providing policy makers, educators, and the public with reports on the academic performance and progress of the nation’s students since 1969 The assessment is given periodically in a variety of subjects: mathematics, reading, writing, science, the arts, civics, economics, geography, U.S history, and technology and engineering literacy NAEP is often

referred to as the nation’s report card because it reports on the educational progress of the nation as a whole The assessment is not given to all students in the country, and scores are not reported for individual students Instead, the assessment is given to representative samples of students across the United States, and results are reported for the nation and for specific groups of students

Since 1983, the results have been reported as average scores on a scale ranging from 0 to 500 Until 1992, results were reported on this scale for the nation as a whole and for students grouped by sociodemographic characteristics, such as by gender, race and ethnicity, and socioeconomic status Beginning in 1993, results were reported

separately by state, and beginning in 2002, also for some urban school districts

Over time, there has been growing interest in comparing educational progress across the states At the same time, there has been increasing interest in having the results reported in a way that policy makers and the public could understand and so that they could be used to examine students’ achievement in relation to high, world-class standards

By 1989, there was considerable support for changes in the way that NAEP results were reported

In part in response to these interests, the idea of reporting NAEP results using achievement levels was first raised in the late 1980s The Elementary and Secondary Education Act of 1988, which authorized the formation of the National Assessment

Governing Board (NAGB), delegated to NAGB the responsibility of “identifying

appropriate achievement goals” (P.L 100-297, Part C, Section 3403 (6) (A)) The

decision to report NAEP results in terms of achievement levels was based on the

NAGB’s interpretation of this legislation

In a 1990 policy statement, NAGB established three “achievement levels”: basic, proficient, and advanced The NAEP results would henceforth report the percentage of test takers by achievement level The percentage of test takers who scored below the basic level would also be reported These new reports would be in addition to summary statistics on the score scale

After a major standard setting process in 1992, NAEP began reporting results in relation to the three achievement levels However, the use of achievement levels has provoked controversy and disagreement, and evaluators have identified numerous

concerns When NAEP was reauthorized in 1994, Congress stipulated that until an

Trang 18

evaluation determined that the achievement levels are reasonable, reliable, valid, and informative to the public, they were to be designated a trial—a provisional status that still remains, 22 years later

In 2014 the U.S Department of Education, through its Institute for Education Sciences, sought to reexamine the need for this provisional status and contracted with the National Academy of Sciences appoint a committee of experts to carry out that

examination The committee’s charge was to determine if the NAEP achievement levels

in reading and mathematics are reasonable, reliable, valid, and informative to the public More specifically, it was to evaluate the student achievement levels used in reporting NAEP results, the procedures for setting those levels, and how they are used (see Chapter

1 for the complete charge)

In addressing its charge, the committee focused on process, outcomes, and uses That is, we evaluated (1) the process for conducting the standard setting; (2) the technical properties of the outcomes of standard setting (the cut scores and the achievement-level descriptors); and (3) the interpretations and uses of the achievement levels

SETTING STANDARDS

In developing achievement levels, NAGB first needed to set standards, a process that involves determining “how good is good enough” in relation to one or more criterion measures For instance, in education one commonly used criterion is how good is good enough to attain an A; in employment settings, it would be the minimum test score needed to become certified or licensed to practice in a given field (whether plumbing or medicine)

To set achievement standards for NAEP, two questions had to be answered: What skills and knowledge do students need in order to be considered basic, proficient, and advanced in each subject area? What scores on the test indicate that a student has

attained one of those achievement levels?

All standard setting is based on judgment For a course grade, it is the judgment

of the classroom teacher For a licensure or certification test, it is the judgment of

professionals in the field For NAEP, it is more complicated As a measure of

achievement for a cross-section of U.S students, NAEP’s achievement levels need to reflect common goals for student learning, despite the fact that students are taught

according to curricula that vary across states and districts To accommodate these

differences, NAEP’s standards need to reflect a wide spectrum of judgments Hence, NAGB sought feedback from a wide range of experts and stakeholders in setting the standards: educators, administrators, subject-matter specialists, policy makers, parent groups, and professional organizations, as well as the general public

Through the standard setting process, NAGB adopted a set of achievement levels for each subject area and grade The achievement levels include a description of the knowledge and skills necessary to perform at a basic, proficient, and advanced level as well as the “cut score,” the minimum score needed to attain each achievement level

Trang 19

FINDINGS AND CONCLUSIONS

The Process

In setting standards for the 1992 reading and mathematics assessments, NAGB broke new ground While standard setting has a long history, its use in the area of

educational achievement testing—and to set multiple standards for a given assessment—

was new While the Standards for Educational and Psychological Testing in place at the

time provided guidance for some aspects of 1992 standard setting, many of the

procedures used were novel and untested in the context of achievement testing for

kindergarten through 12th grade (K-12)

In carrying out the process, NAGB sought advice and assistance from many

measurement and subject-matter experts, including staff of the standard setting contractor,

an advisory group of individuals with extensive standard setting expertise, and NAGB’s own advisers In addition, a panel of members of the National Academy of Education (NAEd) evaluated the work being done

The NAEd evaluators raised questions about the integrity and validity of the process Perhaps most importantly, they criticized the achievement-level descriptors, arguing that they were not valid representations of performance at the specified levels And they criticized the specific method for setting the cut scores, arguing that it was too cognitively complex, thus limiting the validity of the outcome

In spite of the NAEd evaluators’ concerns, NAGB moved forward with

achievement-level reporting for the 1992 assessments of mathematics and reading Since then, NAGB and NCES have sponsored research conferences, sought advice from experts

in standard setting, commissioned research, formed standing advisory groups, held

training workshops, and published materials on standard setting

For its review, the committee considered the Standards for Educational and

Psychological Testing and guidance available in 1992, along with what is known now In

examining the process, we considered the ways in which panelists were selected and trained, how method for setting the cut scores was selected and implemented, and how the achievement-level descriptors were developed Our key findings:

 The process for selecting standard-setting panelists was extensive and, in our judgment, likely to have produced a set of panelists that represented a wide array of views and perspectives

 In selecting a cut-score setting method, NAGB and ACT chose one method for the multiple-choice and short-answer questions and another for the extended-response questions This was novel at the time and is now widely recognized as a best practice

 NAEP’s 1992 standard setting represented the first time that formal, written achievement-level descriptions were developed to guide the standard setting panelists This, too, was novel at the time and is now widely recognized as a best practice

Trang 20

CONCLUSION 3-1 The procedures used by the National Assessment

Governing Board for setting the achievement levels in 1992 are well

documented The documentation includes the kinds of evidence called for in

the Standards for Educational and Psychological Testing in place at the time

and currently and was in line with the research and knowledge base at the time

Outcomes

The standard-setting process used for NAEP began with the frameworks (or blueprints for the mathematics and reading assessments), a general policy description of what each level is intended to represent (e.g., mastery over challenging subject matter), and a set of items that have been used to assess the knowledge and skills elaborated in the assessment frameworks The standard setting process produces two key outcomes The first outcome is a set of detailed achievement-level descriptors, specifying the knowledge and skills required at each of the achievement levels The second outcome is the cut score that indicates the minimum scale score value for each achievement level The

achievement levels defined by these cut scores provide the basis for using and

interpreting test results, and thus, the validity of test score interpretations hinges on the appropriateness of the cut score

In evaluating these outcomes, the committee examined evidence of their

reliability and validity In the context of standard setting, reliability is a measure of the consistency, generalizability, and stability of the judgments (i.e., cut scores) Reliability estimates indicate the extent to which the cut-score judgments are likely to be consistent across replications of the standard setting, such as repeating the standard setting with different panelists, different test questions, on different occasions, or with different methods

NAGB conducted studies to collect three kinds of reliability evidence:

interpanelist agreement; intrapanelist consistency across items of different types; and the stability of cut scores across occasions The actual values of the estimates of consistency suggest a considerable amount of variability in cut-score judgments The available documentation notes that this issue received considerable attention, but the sources and the effects of the variability were not fully addressed before achievement-level results were reported We are hesitant to make judgments about the rationale for decisions made long ago; at the same time, we acknowledge that some of these issues warranted further investigation

CONCLUSION 4-1 The available documentation of the 1992 standard

settings in reading and mathematics include the types of reliability analyses

called for in the Standards for Educational and Psychological Testing that

were in place at the time and those that are currently in place The evidence that resulted from these analyses, however showed considerable variability among panelists’ cut-score judgments: the expected pattern of decreasing variability among panelists across the rounds was not consistently achieved; and panelists' cut-score estimates were not consistent over different item

Trang 21

formats and different levels of item difficulty These issues were not resolved before achievement-level results were released to the public

Validation in the context of standard setting usually consists of demonstrating that the proposed cut score for each achievement level corresponds to the achievement-level descriptor and that the achievement levels are set at a reasonable level, not too low or too high Accordingly, studies were conducted to provide evidence of validity related to test content and relationships with external criteria

With regard to content-related validity evidence, the studies focused on the

alignment between the achievement-level descriptors and cut scores, the frameworks, and the test questions For these studies, a second and sometimes third group of panelists were asked to review the achievement-level descriptors and cut scores produced by the initial standard setting As a result of these reviews, changes were made to the

achievement-level descriptors: some were suggested to NAGB by the panelists; others were made by NAGB

Since 1992, changes have been made to the mathematics and reading frameworks, the assessment tasks, and the achievement level descriptors—most notably, in 2005 and

2009 With the exception of grade-12 mathematics in 2005, no changes have been made

in the cut scores Moreover, the grade-12 descriptors for mathematics were changed in

2005 and 2009, but related changes were not made to those for grades 4 and 8

Consequently, the final descriptors were not the ones that panelists used to set the cut scores

CONCLUSION 5-1 The studies conducted to assess content validity are in

line with those called for in the Standards for Educational and Psychological

Testing in place in 1992 and currently in 2016 The results of these studies

suggested that changes in the achievement-level descriptors were needed, and they were subsequently made These changes may have better aligned the descriptors to the framework and exemplar items, but as a consequence, the final achievement-level descriptors were not the ones used to set the cut scores Since 1992, there have been additional changes to the frameworks, the item pools, the assessments, and studies to identify needed revisions to the achievement level descriptors But, to date, there has been no effort to set new cut scores using the most current achievement level descriptors 1

CONCLUSION 5-2 Changes in the NAEP mathematics frameworks in 2005 led to new achievement level descriptors and a new scale and cut-scores for the achievement levels at 12th grade, but not for the 8th and 4th grades These changes create a perceived or actual break between 12th-grade

mathematics and 8th- and 4th-grade mathematics Such a break is at odds with contemporary thinking in mathematics education, which holds that school mathematics should be coherent across grades

1 This text was revised after the report was initially transmitted to the U.S Department of

Education; see Chapter 1 (“Data Sources”)

Trang 22

Criterion-related validity evidence usually consists of comparisons with other indicators of the content and skills measured by an assessment, in this case, other

measures of achievement in reading and mathematics The goal is to help to evaluate the extent to which achievement levels are reasonable and set at an appropriate level

It can be challenging to identify and collect the kinds of data that are needed to provide evidence of criterion-related validity The ACT reports that document the

validity of the achievement levels do not include results from any studies that compared NAEP achievement levels with external measures It is not clear why NAGB did not pursue such studies In contrast, the NAEd reports include a variety of such studies, such

as comparisons with state assessments, international assessments, advanced placement (AP) tests, and college admissions tests They further conducted a special study in which 4th- and 8th-grade teachers classified their own students into the achievement-level categories by comparing the achievement-level descriptors with the students’ classwork

We examined evidence from similar sources for our evaluation, and consider it to be of

high value in judging the reasonableness of the achievement levels

Our comparisons reveal considerable correspondence between the percentages of students at NAEP achievement levels and the percentages on other assessments These studies show that the NAEP achievement-level results (the percentage of students at the advanced level) are generally consistent with the percentage of U.S students scoring at the reading and mathematics benchmarks on the Programme for International Student Assessment (PISA), the mathematics benchmarks on Trends in International Mathematics and Science (TIMSS), and at the higher levels for advanced placement exams These studies also show that significant numbers of students in other countries score at the equivalent of the NAEP advanced level

CONCLUSION 5-3 The Standards for Educational and Psychological Testing

in place in 1992 did not explicitly call for criterion-related validity evidence for achievement level setting, but such evidence was routinely examined by testing programs The National Assessment Governing Board did not report information on criterion-related evidence to evaluate the reasonableness of the cut scores set in 1992 The National Academy of Education evaluators reported four kinds of criterion-related validity evidence, and they concluded that the cut scores were set very high We were not able to determine if this evidence was considered when the final cut scores were adopted for NAEP

Recent research has focused on validity evidence based on relationships with external variables, that is setting benchmarks on NAEP that are related to concurrent or future performance on measures external to NAEP The findings from this research can

be used to evaluate the validity of new interpretations of the existing achievement levels, suggest possible adjustments to the cut scores or descriptors, or enhance understanding and use of the achievement level results This research can also help establish specific benchmarks that are separate from the existing achievement levels, such as college

readiness

CONCLUSION 5-4 Since the NAEP achievement levels were set, new

research has investigated the relationships between NAEP scores and

Trang 23

external measures, such as academic preparedness for college The findings from this research can be used to evaluate the validity of new interpretations

of the existing performance standards, suggest possible adjustments to the cut scores or descriptors, and or enhance understanding and use of the

achievement level results This research can also help establish specific

benchmarks that are separate from the existing achievement levels This type

of research is critical for adding meaning to the achievement levels

Interpretation and Use

Originally, NAEP was designed to measure and report what U.S students actually

know and are able to do However, the achievement levels were designed to lay out what

U.S students should know and be able to do That is, the adoption of achievement levels

added an extra layer of reporting to reflect the nation’s aspirations for students Reporting NAEP results as the percentage of students who scored at each achievement level was intended to make NAEP results more understandable This type of reporting was

designed to clearly and succinctly highlight the extent to which U.S students are meeting expectations

The committee was unable to find any official documents that provide guidance

on the intended interpretations and uses of NAEP achievement levels, beyond the brief statements in two policy documents The committee was also unable to find documents that specifically lay out appropriate uses and the associated research to support these uses

We found a disconnect between the kind of validity evidence that has been collected and the kinds of interpretations and uses that are made of NAEP’s reported results That is, although the committee found evidence for the integrity and accuracy of the procedures used to set the achievement levels, the evidence does not extend to the uses of the

achievement levels—the way that NAEP audiences use the results and the decisions they base on them

The committee found that considerable information is provided to state and

district personnel and the media in preparation for a release of NAEP results, and NAGB provided us with examples of these materials However, this type of information was not easy to find on the NAEP website

The many audiences for NAEP achievement levels use them in a variety of ways, including to inform public discourse and policy decisions, as was the original intention However, interpretive guidance provided to users is inconsistent and fragmented Some audiences receive considerable guidance just prior to a release of results For audiences that obtain most of their information from the website or hard-copy reports for the

general public, interpretative guidance is hard to locate Without appropriate guidance, misuses are likely The committee found numerous types of inappropriate inferences

CONCLUSION 6-1 NAEP achievement levels are widely disseminated to and used by many audiences, but the interpretive guidance about the meaning and appropriate uses of those levels provided to users is inconsistent and

piecemeal Without appropriate guidance, misuses are likely

Trang 24

CONCLUSION 6-2 Insufficient information is available about the intended

interpretations and uses of the achievement levels and the validity evidence that support these interpretations and uses There is also insufficient

information on the actual interpretations and uses commonly made by

NAEP’s various audiences and little evidence to evaluate the validity of any

of them

CONCLUSION 6-3 The current achievement-level descriptors may not

provide users with enough information about what students at a given level know and can do The descriptors do not clearly provide accurate and

specific information about the things that students at the cut score for each level know and can do

The committee recognizes that the achievement levels are a well- established part

of NAEP, with wide influence on state K-12 achievement tests Making changes to

something that has been in place for over 24 years would likely have a range of

consequences that cannot be anticipated We also recognize the difficulties that might be created by setting new standards, particularly the disruptions that would result from

breaking people’s interpretations of the trends We also note that during their 24 years they have acquired meaning for NAEP’s various audiences and stakeholders: they serve

as stable benchmarks for monitoring achievement trends, and they are widely used to inform public discourse and policy decisions Users regard them as a regular, permanent feature of NAEP reports

To date, the descriptors for grade 12 mathematics and grades 4, 8, and 12 reading have been revised and updated as recently as 2009, but no changes have been made to the descriptors for mathematics in grades 4 and 8 since 2004

We considered several courses of action, ranging from recommending no changes

to recommending a new standard setting We concluded that most of the strongest

criticisms of the current standards—and the argument for completely new standards—can

be addressed instead by revision of the achievement-level descriptors

CONCLUSION 7-1: The cut scores for grades 4 and 8 in mathematics and all grades in reading were set more than 24 years ago Since then, there have been many adjustments to the frameworks, item pools, assessments, and achievement level descriptors, but there has been no effort to set new cut scores for these assessments While priority has been given to maintaining the trend lines, it is possible that there has been “drift” in the meaning of the cut scores such that the validity of inferences about trends is questionable The situation for grade 12 mathematics is similar, although possibly to a lesser extent since the cut scores were set more recently (in 2005) and thus far, only one round of adjustments have been made (in 2009) 2

CONCLUSION 7-2 Although there is evidence to support conducting a new standard setting at this time for all grades in reading and mathematics,

2 This conclusion was added after the report was initially transmitted to the U.S Department of Education; see Chapter 1 (“Data Sources”)

Trang 25

setting new cut scores would disrupt the NAEP trend line at a time when many other contextual factors are changing In the short term, the disruption

in the trend line could be avoided by continuing to follow the same cut scores but ensuring the descriptions are aligned with them In particular, work is needed to ensure that the mathematics achievement-level descriptors for grades 4 and 8 are well aligned with the framework, cut scores, and item pools Additional work to evaluate the alignment of the items and the

achievement level descriptors for grade 4 reading and grade 12 mathematics

is also needed This work should not be done piecemeal, one grade at a time; rather, it should be done in a way that maintains the continuum of skills and knowledge across grades 3

RECOMMENDATIONS

The panel’s charge included the question of reasonableness NAEP and its

achievement levels loom large in public understanding of critical debates about education, excellence, and opportunity One can fairly argue that the “nation’s report card” is a success for that reason alone Through 25 years of use, the NAEP achievement levels have acquired a “use validity” or reasonableness by virtue of familiarity

In the long term, we recommend a thorough revision of the achievement-level descriptions that are informed by a suite of education, social, and economic outcomes important to key audiences We envision a set of descriptions that correspond to a few salient outcomes, such as college readiness or international comparisons The studies we recommend, however, would also offer ways to characterize other scale score

points This information should be available to the public along with test item

exemplars The more audiences understand the scale scores, the less likely they are to misuse the achievement levels

Setting new cut scores at this time, when so many things are in flux, would likely create considerable confusion about their meaning We do not encourage a new standard setting at this time But, we note that at some point the balance of concerns will tip to favor new standard setting procedures There will be evolution in the methodology,

assessment frameworks, the technology of test administration and hence the nature of items, and more We suggest that the U.S Department of Education state an intention to revisit this issue in some stated number of years We offer specific recommendations below

RECOMMENDATION 1 Alignment among the frameworks, the item pools, the achievement-level descriptors, and the cut scores is fundamental to the validity of inferences about student achievement In 2009, alignment was evaluated for all grades in reading and for grade 12 in mathematics, and changes were made to the achievement-level descriptors, as needed Similar research is needed to evaluate alignment for the grade 4 and grade 8

mathematics assessments and to revise them as needed to ensure that they represent the knowledge and skills of students at each achievement level

3 This conclusion was revised after the report was initially transmitted to the U.S Department of Education; see Chapter 1 (“Data Sources”)

Trang 26

Moreover, additional work to verify alignment for grade 4 reading and grade

12 mathematics is needed 4

RECOMMENDATION 2 Once satisfactory alignment among the

frameworks, the item pools, the achievement-level descriptors, and the cut scores in NAEP mathematics and reading has been demonstrated, their designation as trial should be discontinued This work should be completed and the results evaluated as stipulated by law 5 : (20 U.S Code 9622: National

Assessment of Educational Progress:

https://www.law.cornell.edu/uscode/text/20/9622 [September 2016])

RECOMMENDATION 3 To maintain the validity and usefulness of

achievement levels, there should be regular recurring reviews of the

achievement-level descriptors, with updates as needed, to ensure they reflect both the frameworks and the incorporation of those frameworks in NAEP assessments 6

RECOMMENDATION 4 Research is needed on the relationships between the NAEP achievement levels and concurrent or future performance on measures external to NAEP Like the research that led to setting scale scores that represent academic preparedness for college, new research should focus

on other measures of future performance, such as being on track for a

college-ready high school diploma for 8th-grade students and readiness for middle school for 4th-grade students 7

RECOMMENDATION 5 Research is needed to articulate the intended

interpretations and uses of the achievement levels and collect validity

evidence to support these interpretations and uses In addition, research to

identify the actual interpretations and uses commonly made by NAEP’s

various audiences and evaluate the validity of each of them This

information should be communicated to users with clear guidance on

substantiated and unsubstantiated interpretations 8

RECOMMENDATION 6 Guidance is needed to help users determine

inferences that are best made with achievement levels and those best made

4 Recommendation 1 was revised after the report was initially transmitted to the U.S Department

of Education; see Chapter 1 (“Data Sources”)

5 This recommendation was revised after the report was initially transmitted to the U.S

Department of Education; see Chapter 1 (“Data Sources”)

Trang 27

with scale score statistics Such guidance should be incorporated in every report that includes achievement levels 9

A number of aspects of the NAEP reading and mathematics assessments have changed since 1992: the constructs and frameworks; the types of items, including more constructed-response questions; the ways of reporting results; and the addition of

innovative web-based data tools NAEP data have also been used in new ways over the past 24 years, such as reporting results for urban districts, including NAEP in federal accountability provisions, and setting academic preparedness scores New linking studies have made it possible to interpret NAEP results in terms of the results of international assessments, and there are possibilities for linking NAEP 4th-and 8th-grade results to indicating being on track for future learning Although external to NAEP, major national initiatives have significantly altered state standards in reading and mathematics

These and other factors imply a changing context for NAEP Staying current with contemporary practices and issues while also maintaining the trend line for NAEP results are competing goals

RECOMMENDATION 7 NAEP should implement a regular cycle for

considering the desirability of conducting a new standard setting Factors to consider include, but are not limited to: substantive changes in the constructs, item types, or frameworks; innovations in the modality for administering assessments; advances in standard setting methodologies; and changes in the policy environment for using NAEP results These factors should be weighed against the downsides of interrupting the trend data and information 10

Department of Education; see Chapter 1 (“Data Sources”)

Trang 28

1 Introduction

BACKGROUND

Since 1969 the National Assessment of Educational Progress (NAEP) has been providing policy makers, educators, and the public with reports on the academic

performance and progress of the nation’s students The assessment is given periodically

in a variety of subjects: mathematics, reading, writing, science, the arts, civics, economics, geography, U.S history, and technology and engineering literacy NAEP is often

referred to as the Nation’s Report Card because it reports on the educational progress of the nation as a whole The assessment is not given to all students in the country, and scores are not reported for individual students Instead, the assessment is given to

representative samples of students across the country, and scores are reported only for groups of students During the first decade of NAEP, results were reported for the nation

as a whole and for students grouped by social and demographic characteristics, including gender, race and ethnicity, and socioeconomic status

Over time, there was a growing desire to compare educational progress across the states At the same time, there was an increasing desire to report the results in ways that policy makers and the public could better understand and to examine students’

achievement in relation to world-class standards By 1988, there was considerable

support for these changes

The Elementary and Secondary Education Act of 1988 authorized the formation

of the National Assessment Governing Board (NAGB) and gave it responsibility for setting policy for NAEP and “identifying appropriate achievement goals” (Public Law 100-297, Part C, Section 3403 (6) (A)) In part in response to this legislation, NAGB decided to report NAEP results in terms of achievement levels, and that reporting began with the 1992 assessments

Three achievement levels are used: “basic,” “proficient,” and “advanced.”

Results are reported according to the percentage of test takers whose scores are at each achievement level, and brief descriptions of the levels are provided with the results.1 The percentage of test takers who score below the basic level also is reported: Figures 1-2 through 1-7 in the annex to the chapter show an example of this type of reporting

The purpose for setting achievement levels was explicitly stated (National

Assessment Governing Board, 1993, p 1) for NAEP:

1 We use the words “description” and “descriptor” interchangeably in discussing achievement levels, as is done in the field

Trang 29

[to make it] unmistakably clear to all readers and users of NAEP data that these

are expectations which stipulate what students should know and should be able to

do at each grade level and in each content area measured by NAEP The

achievement levels make the NAEP data more understandable to general users, parents, policymakers and educators alike They are an effort to make NAEP part

of the vocabulary of the general public

In the ensuing two-and-a-half decades, the use of achievement levels has become

a fixture of the reporting of test results, not only for NAEP, but also for a number of other testing programs Notably, the No Child Left Behind Act of 2001 required that all states set achievement levels for their state tests, and many used the same names for

achievement levels as those used by NAEP

Given the potential value to the nation of setting achievement levels for NAEP and the importance of “getting it right,” the procedures and results have received

considerable scrutiny Before achievement-level results were reported for the 1992 mathematics and reading, numerous evaluations were conducted, including: Stufflebeam

et al (1991), Linn et al (1992a; 1992b), and Koretz and Deibert (1993), which focused

on the 1990 mathematics results; the U.S General Accounting Office (1993), which focused on the 1992 mathematics results; and Shepard et al (1993), which covered the

1992 mathematics and reading results

These reviews raised numerous concerns about the ways that NAEP’s

achievement levels had been developed and the extent to which they would support the intended interpretations The reviews generally concluded that (1) the achievement levels were not necessarily valid indicators of educators’ judgments of what students should know and be able to do and (2) the achievement-level descriptors were not accurate representations of the NAEP assessments and frameworks

When NAEP was up for reauthorization in 1994, Congress stipulated that the achievement levels be used only on a “trial basis until the Commissioner [of the National Center for Education Statistics] determines, as a result of an evaluation, that such levels are reasonable, valid, and informative to the public” (P.L 107-110 (2002), Sec 602(e)) Since that time, achievement-level reports have carried a footnote indicating that they are trial—a provisional status that has remained for 25 years

In 2014 the U.S Department of Education sought to reexamine the need for the provisional status of the NAEP achievement levels, and it requested a study under the auspices of the National Academy of Science, Engineering, and Medicine to conduct this examination The work was carried out under the auspices of two standing activities in the division of Behavioral and Social Sciences and Education: the Board on Testing and Assessment and the Committee on National Statistics Together, the two boards

established the Committee on the Evaluation of NAEP Achievement Levels in Reading and Math The 15 committee members, who served as volunteers, had a broad range of expertise related to assessment, education policy, mathematics and reading, program evaluation, social science, and statistics

The following statement of task guided the committee’s work:

Trang 30

An ad hoc committee will conduct an evaluation of the student achievement levels that are used in reporting results of the NAEP assessments in reading and

mathematics in grades 4, 8, and 12 to determine whether the achievement levels are reasonable, reliable, valid, and informative to the public The committee will review the achievement level setting procedures used by the National Assessment Governing Board and the National Center for Education Statistics, the ways that the achievement levels are used in NAEP reporting, the interpretations made of them and the validity of those interpretations, and the research literatures related

to setting achievement levels and reporting test results to the public The

committee will write a final report that describes its findings about the

achievement levels and how the levels are used in NAEP reporting If warranted, the committee's report will also provide recommendations about ways that the setting and use of achievement levels for NAEP can be improved

To address its charge, the committee held three in-person meetings and four half-day virtual meetings during 2015 Before discussing our approach to the study, we provide some background on the process for developing achievement levels, or, more generally,

standard setting and on the key features of NAEP

STANDARD SETTING

Translating NAEP scale scores into achievement levels involves a process

referred to as standard setting, in which different levels of performance are identified and described, and the test scores that distinguish between the levels are determined Box 1-1 defines the terms often used with standard setting

Standard setting involves determining “how good is good enough” in relation to one or more achievement or performance measures For instance, educators use it

regularly to assign grades (e.g., how good is good enough for an A or a B) In

employment settings, it has long been used to determine the minimum test score needed

to become certified or licensed to practice in a given field: examples include medical board tests for doctors, licensing tests for nurses, bar examinations for lawyers, and

certification examinations for accountants In these education and employment examples,

a concrete criterion can be stated: What material should a student know and be able to do

to receive an A in a course? What should a prospective practitioner know and be able to

do to practice safely and effectively in a given field?

All standard setting is based on judgment For a course grade, it is the judgment

of the classroom teacher For a licensure or certification test, it is the judgment of

professionals who work in the field For NAEP, it is more complicated As a measure of achievement for a cross-section of U.S students, NAEP’s achievement levels need to reflect common goals for student learning—despite the fact that students are taught

according to curricula that vary across states and districts To accommodate this variation, NAEP’s standards needed to reflect a wide spectrum of judgments about what

achievement was intended by those who chose the curricula

For large-scale tests like NAEP, standard setting is a formal process, with

guidelines for how it should be done Generally, the process involves identifying a set of individuals with expertise in the relevant areas, recruiting them to serve as standard

Trang 31

setting panelists, training them to perform the standard setting tasks, and implementing the procedures to obtain their judgments There are two outcomes of the process: (1) a detailed description of what students at each achievement level should know and can do; and (2) the cut score that marks the minimum score needed to be placed at a given

achievement level

There are many methods for conducting standard settings They differ in the nature of the tasks that panelists are asked to do and the types of judgments they are asked to make There is no single “right” method; the choice of method is based on the kinds of question formats (e.g., multiple choice, short answer, extended response),

various characteristics of the assessment and its uses, and often the experiential base of those conducting the standard setting Regardless of the method chosen, the most critical issue is that the process is carried out carefully and systematically, following accepted procedures

The achievement-level descriptions and corresponding cut scores are intended to reflect performance as defined through a subject-area framework (see below) Each assessment is built around an organizing framework, which is the blueprint that guides the development of the assessment instrument and determines the content to be assessed

For NAEP, the frameworks capture a range of subject-specific content and

thinking skills needed by students to deal with what they encounter, both inside and outside their classrooms The NAEP frameworks are determined through a development process designed to ensure that they are appropriate for current educational requirements Because the assessments must remain flexible to mirror changes in educational objectives and curricula, the frameworks are designed to be forward-looking and responsive,

balancing current teaching practices with research findings.2

There is an assumed relationship between a subject-area framework and other elements of the assessment First, given that the stated purpose of a framework is to guide the development of items for the assessment, there must be a close correspondence between a framework and the assessment items.3

Second, given that NAGB intends the achievement levels and the frameworks to remain stable over several test administrations (while actual items will change with each administration), the frameworks must serve as the pivotal link between the assessments and the achievement levels—both as they are described narratively and as they are

operationalized into item-level judgments and, ultimately, cut scores Ideally, this would mean that achievement-level descriptors are crafted concurrently with the framework (see

Shepard et al., 1993, p 47; also see Bejar et al., 2008)

In principle, then, the achievement-level descriptions should guide test

development so that the tests are well aligned to the intended constructs (concepts or characteristics) of interest The achievement-level descriptors guide standard setting so that panelists can operationalize them in terms of cut scores with the same

2 For NAEP’s description of the frameworks, see

http://nces.ed.gov/nationsreportcard/frameworks.aspx [January 2016] Item specifications for reading can

be found at https://www.nagb.org/publications/frameworks/reading/2009-reading-specification.html [August 2016] Item specifications for mathematics can be found at:

https://www.nagb.org/publications/frameworks/mathematics/2009-mathematics-specification.html [August 2016]

3 For each assessment, the framework is supplemented with a set of Assessment Specifications that provided additional details about developing and assembling a form of the test

Trang 32

conceptualization used by item writers To the extent that tests are well aligned to the achievement-level descriptors, the latter reflect the knowledge and skills possessed by the students at or above each cut score Therefore, the descriptors used in score reporting actually represent the observed skills of students in a particular achievement-level

category (Egan et al., 2012; Reckase, 2001) Figure 1-1 shows the assumed relationships

KEY FEATURES OF NAEP

NAEP consists of two different programs: main NAEP and long- term trend

NAEP Main NAEP is adjusted as needed so that it reflects current thinking about

content areas, assessment strategies, and policy priorities, but efforts are made to

incorporate these changes in ways that do not disrupt the trend line Main NAEP is the subject of this report Long-term trend NAEP, in contrast, provides a way to examine achievement trends with a measure that does not change; these assessments do not

incorporate advances in content areas or assessment strategies Both main and long-term trend NAEP differ fundamentally from other testing programs in that its objective is to obtain accurate measures of academic achievement for groups of students rather than for individuals This goal is achieved using innovative sampling, scaling, and analytic

introduction of the TUDA, a sampling plan was created for each participating urban district At that point, for the states with TUDAs, the state sample was augmented by the TUDA sample However, the two could be separated for analysis purposes The national samples for NAEP are selected using stratified multistage sampling designs with three stages of selection: districts, schools, and students The result is a sample of about

150,000 students sampled from 2,000 schools The sampling design for state NAEP has only two stages of selection: schools and students within schools The results are

samples of approximately 3,000 students in 100 schools per state (roughly 100,000

students in 4,000 schools nationwide)

For the national assessment in 1992, approximately 26,000 4th-, 8th- and grade students in 1,500 public and private schools across the country participated in the national assessment For jurisdictions that participated in the separate state program, approximately 2,500 students were sampled from approximately 100 public schools for each grade and curriculum area Thus, a total of approximately 220,000 4th- and 8th-

12th-4 This section describes the sampling design for the math and reading assessments only

Trang 33

grade students who were attending nearly 9,000 public schools participated in the 1992 trial state assessments In 1996, between 3,500 and 5,500 students were tested in

mathematics and science and between 4,500 and 5,500 were tested in reading and writing (Campbell et al., 1997)

Sampling of Items

NAEP assesses a cross-section of the content within a subject-matter area Due to the large number of content areas and subareas within them, NAEP uses a matrix

sampling design to assess student achievement in each subject Using this design,

different blocks of items drawn from the overall content domain are administered to different groups of students: this approach makes it possible to administer a large

number and range of items while keeping individual testing time to 1 hour for all subjects Students receive different but overlapping sets of NAEP items using a form of matrix

subsampling known as balanced incomplete block spiraling This design requires highly

complicated analyses and does not permit the performance of a particular student to be accurately measured Therefore, NAEP reports only group-level results; individual results are not reported

Analytic Procedures

Although individual results are not reported, it is possible to compute estimates of individuals’ performance on the overall assessment using complex statistical procedures The observed data reflect student performance over the particular NAEP blocks each student actually took Since no individual takes all NAEP blocks, statistical estimation procedures are used to derive estimates of individuals’ proficiency on the full

complement of skills and content covered by the assessment, on the basis of the test blocks that an individual took

The procedure involves combining samples of values drawn from distributions of possible proficiency estimates for each student These individual student distributions are estimated from their responses to the test items and from background variables The use

of background variables in estimating proficiency is called conditioning For each student, five values, called plausible values, are randomly drawn from the student’s distribution of

possible proficiency estimates.5 The plausible values are intended to reflect the

uncertainty in each student’s proficiency estimate, given the limited set of test questions taken by each student The sampling from the student’s distribution is an application of Rubin’s (1987) multiple imputation method for handling missing data (the responses to items not presented to the student are considered missing) In the NAEP context, this

process is called plausible values methodology (National Research Council, 1999)

Statistics Reported

NAEP currently reports student performance on the assessments using a scale that ranges from 0 to 500 for 4th- and 8th-grade mathematics and for all grades for reading Originally, 12th-grade mathematics results were also reported on this scale; the scale was

5 Beginning with the 2013 analysis, the number of plausible values was increased to twenty

Trang 34

changed to a range of 0 to 300 when the framework was revised in 2004-2005 (see chapter 5) Scale scores summarize performance in a given subject area for the nation as

a whole, for individual states, and for subsets of the population defined by demographic and background characteristics Results are tabulated over time to provide trend

information

As described above, NAEP also reports performance using achievement levels

NAEP collects a variety of demographic, background, and contextual information from students, teachers, and administrators Student demographic and academic

information includes such characteristics such as race and ethnicity, gender, highest level

of parental education, and status as a student with disabilities or an English-language learner Contextual and environmental data provide information about students’ course selection, homework habits, use of textbooks and computers, and communication with parents about schoolwork Information obtained from teachers includes the training they received, the number of years they have taught, and the instructional practices they use Information obtained from administrators covers their schools, including the location and type of school, school enrollment numbers, and levels of parental involvement NAEP summarizes achievement results by these various characteristics.6

COMMITTEE APPROACH

The committee held three in-person meetings and four half-day virtual meetings during 2015 The first two meetings included public sessions at which the committee gathered a great deal of information At the first meeting, officials from NAGB and the National Center for Education Statistics (NCES) described their goals for the project and discuss the types of information available to the committee At the second meeting, a half-day public forum provided an opportunity for people to talk about how they interpret and use the reported achievement levels This public forum was organized as five panel discussions, each focused on a type of audience for NAEP results: journalists and

education writers; state policy users; developers of the assessments designed to be

aligned with the Common Core State Standards; research and advocacy groups; and a synthesis panel with two experts in standard setting: see Appendix A for the forum agenda

We designed our evaluation accordingly It is important to point out that the four factors—reasonable, reliable, valid, and informative—are interrelated and cannot be evaluated in isolation of each other In addition, they are connected by the underlying purpose(s) of achievement-level reporting: that is, what are the intended uses of

6 Results from the most recent tests can be found at

http://nces.ed.gov/nationsreportcard/about/current.aspx [February 2016]

Trang 35

achievement levels? For example, in making judgments about reliability, one needs to consider what types of inferences will be made and the decisions and consequences that will result from those inferences The same is true for validity: valid for what inferences, what interpretations, and what uses?

Thus, the committee chose not to organize its work and this report around each factor separately Instead, we organized our review around three types of evidence to

support the validity and use of achievement level reporting: the process of setting

standards, the outcomes of the standard setting, the cut scores, and the achievement-level descriptors, and the interpretations and uses of achievement levels We identified a set

of questions to guide our data collection for each factor, and we considered this

information in relation to the Standards for Educational and Psychological Testing

(American Educational Research Association et al., (hereafter, Standards), both the most

current version (2014) and the version in use at the time the achievement levels were developed (1985) Box 1-2 lists the questions we posed for our work

Data Sources

To address its charge, the committee collected and synthesized a wide range of information, including the results of complex statistical analyses, reports of NAEP results, and commentary about them We used the following types of information:

 general (publicly available) background information about NAEP, such as its purpose and intended uses, reading and mathematics frameworks, item pools and test specifications, item types, reporting mechanisms;

 reports of results of NAEP assessments of reading and mathematics and the corresponding interpretation guidance;

 interactive reporting tools available on the NAEP websites;

 historical documentation of the policy steps involved in creating the

achievement levels for reading and mathematics;

 technical documentation of standard setting procedures (setting the cut scores, developing detailed achievement-level descriptors), and associated analyses;

 reports from prior external evaluations of the extent to which the achievement levels are reliable, valid, reasonable, and informative to the public;

 reports from internal evaluations (generally by or for NAGB) of the extent to which the achievement levels are reliable, valid, reasonable, and informative

to the public, along with responses (and actions taken) to the criticisms made

in the external evaluations;

 empirical studies of the reliability, validity, usefulness, and “informativeness”

of the achievement levels and achievement-level reporting, including studies

of the effects of using various procedures for establishing the cut scores;

 professional standards with regard to standard setting (e.g., those of the

American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education;

 edited volumes and other scholarly literature from well-known experts in the field of standard setting, summarizing research and empirical studies and offering insights and advice on best practice;

Trang 36

 subjective accounts about achievement-level reporting, including opinion pieces, commentaries, newspaper columns and articles, blogs, tweets,

conference and workshop presentations, and public forums; and

 other reports prepared by research and policy organizations about specific aspects of achievement-level reporting for NAEP

Our use of these data sources varied When drawing conclusions from the

empirical evidence, we gave precedence to results from analyses published in reviewed journals However, as is often the case with evaluations, these types of reports were scarce To compensate for this limitation, the committee placed the greatest weight

peer-on evidence that could be corroborated through multiple sources

There have been numerous literature reviews summarizing the results of these studies and offering advice on best practices In order to be able to address our charge within the allotted time period, we reviewed several compendia and edited volumes by well-respected experts in the field of standard setting These documents are well known

in the measurement field; the editors and authors are well respected in the field of

standard setting, and the various chapters cover an array of views Specifically, we relied

on following documents, listed in chronological order:

 a special issue of the Journal of Educational Measurement (National Council

on Measurement in Education, 1978)

 Jaeger (1989)

 a special issue of Educational Measurement: Issues and Practice (Nitko,

1991)

 Crocker and Zieky (1995a, 1995b)

 Bourque (1997) and Bourque and Byrd (2000)

department informed the committee that there were three other reports that had not been provided to us The reports document studies of the extent to which the item pool and the achievement level descriptors are aligned and the resulting changes made to improve alignment One study was conducted in 2003 (Braswell, J and Haberstroh, J., 2004, May); two others were conducted in 2009, when changes were made to the mathematics framework for grade 12 and the reading framework for all grades (4, 8, and 12) (Pitoniak

et al., 2010; Donahue et al., 2010).7 Because of this new information, the committee needed additional time to analyze the reports and incorporate conclusions from them in our report We indicate in Chapters 5, 7, and 8 where the text was changed after the original transmittal to reflect the new information

7 Copies of these studies can be obtained by the Public Access Records Office by phone 3543) or email (paro@nas.edu)

Trang 37

(202-334-The Department of Education also noted several places in the report at which there were small factual errors Several recommendations were reworded because they misstated the agency responsible for a proposed action These changes were reviewed in accordance with institutional policy and are noted in the Summary and Chapter 8 We added a brief description of the item-rating process during the standard setting, which is noted in Chapter 3 And, as a result of some of these other changes, we added a new conclusion (7-1), which is noted in the Summary and Chapters 7 and 8

The late information received from the Department of Education has added to the richness of the information in this report However, it has not in any way changed the committee’s fundamental conclusions and recommendations

Guide to the Report

The report is organized around our evaluation questions Chapter 2 provides additional context about the origin of achievement levels for NAEP, the process for determining them, and the evaluations of them It also discusses changes in the field of standard setting over the past 25 years Chapter 3 discusses the evidence we collected on the process for setting achievement levels

Chapter 4 and 5, respectively, consider the evidence of the reliability and validity

of the achievement levels Chapter 6 discusses the intended uses of the achievement levels and the evidence that supports those uses, along with the actual uses of the levels and common misinterpretations

In Chapters 7 we explore the issues to consider in deciding whether a new

standard setting is needed, and we present our recommendations in Chapter 8

Trang 38

BOX 1-1 Key Terms in Standard Setting

Unless specifically noted, the definitions below are excerpted and adapted from

Standards for Educational and Psychological Testing (American Educational Research

Association, American Psychological Association, and National Council on Measurement

in Education, 2014)

Achievement level (also called performance level, proficiency level, performance

standard): Label or brief statement classifying a test taker’s competency in a particular domain, usually defined by a range of scores on a test For example, labels such as basic

to advanced or novice to expert constitute broad ranges for classifying proficiency (pp

215, 221)

Achievement level descriptor (also called performance level descriptor, proficiency

level descriptor; descriptor and description are often used interchangeably): Qualitative descriptions of test takers’ levels of competency in a particular area of knowledge or skill, usually defined in terms of categories ordered on a continuum The categories constitute

broad ranges for classifying test takers’ performance (p 215)

Claim: A statement (inference, interpretation) made about students’ knowledge and skills

based on evidence from test performance (Kane, 2006, p.27)

Content domain: The set of behaviors, knowledge, skills, abilities, attitudes, or other

characteristics to be measured by a test, represented in detailed test specifications and often organized into categories by which items are classified (p 218)

Construct: The concept or characteristic that a test is designed to measure, such as

mathematics or reading achievement (p 217)

Cut score: A specific point on a score scale, such that scores at or above that point are

reported, interpreted, or acted upon differently from scores below that point (p 218)

Framework: A detailed description of the construct to be assessed, delineating the scope

of the knowledge, skills, and abilities that are represented The framework includes the blueprint for the test that is used to guide the development of the questions and tasks, determine response formats (multiple-choice, constructed response), and scoring

procedures (p 11)

Performance standard: Statements of what test takers at different performance levels

know and can do, along with cut scores or ranges of scores on the scale of an assessment that differentiate levels of performance A performance standard consists of a cut score and a descriptive statement (p 221)

Trang 39

Standard setting: The process, often judgment based, of setting cut scores using a

structured procedure that seeks to map test scores into discrete performance levels that are usually specified by performance-level descriptors (p 224)

SOURCE: Adapted from American Educational Research Association et al (2014)

Trang 40

BOX 1-2 Evaluation Questions

1 Why was achievement-level reporting implemented? What was it intended to

accomplish? How are achievement levels intended to be interpreted and used? What inferences are they intended to support? What comparisons are appropriate?

2 What validity evidence exists that demonstrates these interpretations and uses are supportable?

a What evidence exists to document that the achievement-level-setting process was handled in a way likely to produce results that support the intended

interpretations and uses?

b What evidence exists to document that the achievement levels represent the content and skills they were intended to represent (content- or construct-based validity)? Was the process handled in a way likely to produce consensus among the panelists? To what extent is there congruence between the

frameworks, test questions, achievement-level descriptors, and cut scores?

c What evidence exists to document that the cut scores are appropriate, not too high nor too low? To what extent do relationships with other variables suggest that the cut scores are appropriate and support the intended interpretations and uses (criterion-related validity)?

3 Was the overall process for determining achievement levels—their descriptions, the designated levels (basic, proficient, advanced), and cut scores—reasonable and

sensible? Did it follow generally accepted procedures (at the time the achievement levels were set and also according to the current state of the field and knowledge base)?

4 Did the process yield a reasonable set of cut scores? Do the score distributions (the percentage at each achievement level) seem reasonable, given what is known about student achievement from other sources? Are the higher levels (proficient, advanced) attainable?

5 What questions do stakeholders want and need NAEP to answer? Do level reports respond to these wants and needs? Do achievement-level reports respond

achievement-to these wants and needs better than reports on other metrics (e.g., summaries of scale scores)?

6 What are common interpretations and uses? What are common misinterpretations and misuses? What guidance is provided to help users interpret achievement-level results?

Định dạng
Số trang	266
Dung lượng	15,18 MB