Koenig, Essessment Statistics ciences and E for Mathem Editors Education matics and Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Ed
Trang 2ation of NAE
er Edley, Jr.,Board on TesCommittee oBehavioral an
EP AchievemReading , and Judith Asting and Asand
on National S
nd Social Sc
ment Levels
A Koenig, Essessment Statistics ciences and E
for Mathem
Editors
Education
matics and
Evaluation of the Achievement Levels
for Mathematics and Reading
on the National Assessment
of Educational Progress
Trang 3This activity was supported by Contract No ED-IES-14-C-0124 from the U.S
Department of Education Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project
International Standard Book Number-13: 978-0-309-43817-9
International Standard Book Number-10: 0-309-43817-9
Digital Object Identifier: 10.17226/23409
Additional copies of this report are available for sale from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu
Copyright 2016 by the National Academy of Sciences All rights reserved
Printed in the United States of America
Suggested citation: National Academies of Sciences, Engineering, and Medicine (2016)
Evaluation of the Achievement Levels for Mathematics and Reading on the National Assessment of Educational Progress Washington, DC: The National Academies Press
doi: 10.17226/23409
Trang 4cademy of
National Acadvising thcontribution
cademy of
1970 under
on on mediguished co
emies work
nd Medicin
ation and clic policy dresearch, runderstand
out the Nat
ww.nationa
rrected proo
f Sciences w
dent Lincolation on issheir peers fdent
k together
ne to provid
conduct othdecisions Tecognize oding in mat
ng was esta
Sciences toMembers areering Dr
(formerly t
er of the Naalth issues
s to medicin
as the Nati
de indepenher activitieThe Nationautstanding tters of scie
emies of Sc
es.org
ished in 186vate, nongo
ed to sciencnding contri
63 by an Acovernmenta
ce and techibutions to
n 1964 undepractices o
by their pe
e, Jr., is pr
te of Medicdemy of Scare electedalth Dr Vic
emies of S
ective analycomplex p
es also encons to knowneering, anngineering,
ct of
al hnology research
er the
of ers for resident
ine) was ciences to
d by their ctor J Dzau
Sciences,
ysis and problems courage wledge, and
nd and
u
d
Trang 5er reviewedneering, andhronicle theother convare those onts, the planeering, and
n about othionalacade
idence-baseinclude finhered by th
d and are a
d Medicine
e presentatening even
of the partianning com
d Medicine
er productemies.org/w
ed consensndings, con
he committeapproved by
tions and di
nt The staticipants andmmittee, or
s and activwhatwedo
sus of an aunclusions, a
ee and com
y the Natio
iscussions atements an
d have not
r the Nationvities of the
uthoring coand recommmmittee deonal Academ
at a worksh
d opinions been endonal Academ
e National A
mmittee ofmendations eliberationsmies of
hop, contained orsed by mies of Academies,
f
s
,
Trang 6Prepublication copy- Uncorrected proofs
COMMITTEE ON THE EVALUATION OF NAEP ACHIEVEMENT LEVELS
CHRISTOPHER EDLEY, JR (Chair), School of Law, University of California,
Berkeley School of Law
PETER AFFLERBACH, Department of Teaching and Learning, Policy and Leadership, University of Maryland
SYBILLA BECKMANN, Department of Mathematics, University of Georgia
H RUSSELL BERNARD, Institute for Social Science Research, Arizona State
University, and Department of Anthropology, University of Florida
KARLA EGAN, National Center for the Improvement of Educational Assessment,
Dover, NH
DAVID J FRANCIS, Department of Psychology, University of Houston
MARGARET E GOERTZ, Graduate School of Education, University of Pennsylvania
(emerita)
LAURA HAMILTON, RAND Education, RAND Corporation, Pittsburgh, PA
BRIAN JUNKER, Department of Statistics Carnegie Mellon University
SUZANNE LANE, School of Education, University of Pittsburgh
SHARON J LEWIS, Council of the Great City Schools, Washington, DC (retired) BERNARD L MADISON, Department of Mathematics, University of Arkansas
SCOTT NORTON, Standards, Assessment, and Accountability, Council of Chief State
School Officers, Washington, DC
SHARON VAUGHN, College of Education, University of Texas at Austin
LAURESS WISE, Human Resources Research Organization, Monterey, CA
JUDITH KOENIG, Study Director
JORDYN WHITE, Program Officer
NATALIE NIELSEN, Acting Board Director (until June 2015)
PATTY MORISON, Acting Board Director (from June 2015)
KELLY ARRINGTON, Senior Program Assistant
Trang 7BOARD ON TESTING AND ASSESSMENT
DAVID J FRANCIS, (Chair) Texas Institute for Measurement, Evaluation, and
Statistics, University of Houston
MARK DYNARSKI, Pemberton Research, LLC, East Windsor, NJ
JOAN HERMAN, National Center for Research on Evaluation, Standards, and Student
Testing, University of California, Los Angeles
SHARON LEWIS, Council of Great City Schools, Washington, DC
BRIAN STECHER, Education Program, RAND Corporation, Santa Monica, CA JOHN ROBERT WARREN, Department of Sociology, University of Minnesota
NATALIE NIELSEN, Acting Director (until June 2015)
PATTY MORISON, Acting Director (from June 2015)
Trang 8Prepublication copy- Uncorrected proofs
COMMITTEE ON NATIONAL STATISTICS
LAWRENCE D BROWN (Chair), Department of Statistics, Wharton School,
University of Pennsylvania
JOHN M ABOWD, School of Industrial and Labor Relations, Cornell University FRANCINE BLAU, Department of Economics, Cornell University
MARY ELLEN BOCK, Department of Statistics, Purdue University
MICHAEL E CHERNEW, Department of Health Care Policy, Harvard Medical
School
DON A DILLMAN, Department of Sociology, Washington State University
CONSTANTINE GATSONIS, Center for Statistical Sciences, Brown University JAMES S HOUSE, Survey Research Center, Institute for Social Research, University
of Michigan
MICHAEL HOUT, Survey Research Center, University of California, Berkeley
THOMAS L MESENBOURG, U.S Census Bureau (retired)
SUSAN A MURPHY, Department of Statistics, University of Michigan
SARAH M NUSSER, Department of Statistics, Center for Survey Statistics and
Methodology, Iowa State University
COLM A O’MUIRCHEARTAIGH, Harris Graduate School of Public Policy Studies,
University of Chicago
RUTH D PETERSON, Criminal Justice Research Center, Ohio State University ROBERTO RIGOBON, Sloan School of Management, Massachusetts Institute of
Technology
EDWARD H SHORTLIFFE, Biomedical Informatics, Columbia University and
Arizona State University
CONSTANCE F CITRO, Director
BRIAN HARRIS-KOJETIN, Deputy Director
Trang 10Prepublication copy- Uncorrected proofs
achievement goals for U.S students In response, NAEP adopted standards-based
reporting for the subjects and grades it assessed
Today, some 24 years later, the focus on achievement standards is even more intense At its best, standards-based reporting can provide a quick way to summarize students’ achievement and track their progress It can clearly demark disparities between what we expect our students to know and be able to do as articulated in the standards—and what they actually know and can do It can stimulate policy conversations about educational achievement across the country, identifying areas and groups with high performance as well as those with troubling disparities It can inform policy interventions and reform measures to improve student learning
There are potential downsides, however Standards-based reporting can lead to erroneous interpretations It can overstate or understate progress, particularly when the goal is to monitor the performance of subgroups In its attempt to be easily understood
by all audiences, it can lead to over-simplifications, such as when users do not do the necessary work to ensure their understandings are correct At a time when policy makers are vitally interested in ensuring that all of the country’s students achieve high standards,
it is critical that test results are reported in a way that leads to accurate and valid
interpretations, and that the standards used in reporting deserve the confidence of
policymakers, practitioners, and the public
The U.S Department of Education turned to the National Academies of Science, Engineering, and Medicine to answer these questions and the Academies set up the Committee on the Evaluation of NAEP Achievement Levels for Mathematics and
Reading The purpose of the project was to evaluate to what extent NAEP’s standards (or achievement levels) are reliable and valid Are they reasonable? Are they informative to the public? Do they lead to appropriate interpretations? The committee of 15 brought a broad range of education and assessment experience Our consensus findings,
conclusions, and recommendations are documented in this report
The committee benefited from the work of many others, and we wish to thank the many individuals who assisted us We first thank the sponsor who supported this work: the U.S Department of Education and the staff with the department’s Institute for
Education Sciences who oversaw our work, Jonathon Jacobson and Audrey Pendleton
During the course of its work, the committee met three times in person and four times by video conference The committee’s first meeting included a public session
Trang 11designed to learn more about the sponsor’s goals for the project and the history of
achievement level setting for the National Assessment of Educational Progress (NAEP)
We heard from officials from the National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES) We especially thank Cornelia Orr,
former Executive Director, and Sharyn Rosenberg, Assistant Director for Psychometrics
with NAGB and Peggy Carr, acting commissioner of NCES, for the vast amount of historical information they provided
The committee’s second meeting included a public forum designed to provide an opportunity for committee members to hear first-hand accounts of the variety of ways NAEP users interpret and use the achievement levels These discussions were
enormously enlightening to the committee, and we thank all who participated This includes: Patte Barth, Sonja Brookins Santelises, Sarah Butrymowicz, Michael Casserly, Enis Dogan, Wendy Geiger, Catherine Gewertz, Renee Jackson, Scott Jenkins, Mike Kane, Jacqueline King, Lyndsey Layton, Nathan Olson, Emily Richmond, Bob Rothman, Lorrie Shepard, and Dara Zeehandlaar The agenda in Appendix A shows their titles and affiliations
The committee gratefully acknowledges the dedicated effort provided by the staff
of the Board on Testing and Assessment (BOTA) and the Committee on National
Statistics (CNSTAT) who worked directly on this project We first thank Natalie Nielsen, former acting director of BOTA, for her extensive efforts to make this project a reality
and her dedication in ensuring its success We thank Patricia Morison, current acting director of BOTA, for her support and guidance at key stages in this project We are
grateful to Connie Citro, director of CNSTAT and Robert Hauser, executive director of DBASSE, for their sage advice through the course of this project We thank Kelly
Arrington, senior program assistant, for her exceptional organizational skills and her close attention to detail Kelly handled all of the administrative details associated with the in-person and virtual meetings and the public forum, and she provided critical support in preparing the manuscript We are grateful to Jordyn White, program officer with
CNSTAT, for her adept research skills and her expertise in assimilating data and
designing understandable presentations We especially thank Judy Koenig for her
intellectual and organizational skills as the study director and for her work in assembling the critical information needed to prepare this report Judy worked tirelessly to keep us on
task and to ensure that we met the multiple challenges presented by our charge
We also thank members of the Office of Reports and Communication of the Division of Behavioral and Social Sciences for their dedicated work on this report We are indebted to Eugenia Grohman for her sage editorial advice on numerous versions of this manuscript We thank Kirsten Sampson Snyder for her work in coordinating a very intense review process and Yvonne Wise for shepherding the manuscript through
production
This report has been reviewed in draft form by individuals chosen for their
diverse perspectives and technical expertise, in accordance with procedures approved by the Report Review Committee of the National Academies of Science, Engineering, and Medicine The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as
possible and to ensure that the report meets institutional standards for objectivity,
Trang 12Prepublication copy- Uncorrected proofs
evidence, and responsiveness to the study charge The review comments and draft
manuscript remain confidential to protect the integrity of the deliberative process
We thank the following individuals for their review of this report: Lisa M
Abrams, Department of Foundations of Education, School of Education, Virginia
Commonwealth University; Brian E Clauser, Educational Assessment, National Board of Medical Examiners; Sarah W Freedman, Graduate School of Education, University of California, Berkeley; Roger Howe, Department of Mathematics, Yale University; Nonie
K Lesaux, Graduate School of Education, Harvard University; Joseph A Martineau, Senior Associate, Center for Assessment, Dover, NH; William Penuel, School of
Education, University of Colorado Boulder; Barbara Plake, Boros Institute of Mental Measurements, Emeritus, University of Nebraska-Lincoln; Mark D Reckase, Department
of Counseling, Educational Psychology and Special Education, Michigan State
University; Lorrie A Shepard, School of Education, University of Colorado Boulder; Dale Whittington, Research and Accountability, Shaker Heights City School District, Shaker Heights, OH
Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations nor did they see the final draft of the report before its release The review of this report was overseen by Edward H Haertel, School of Education, Stanford University; and Ronald S Brookmeyer, Department of Biostatistics, University of California, Los Angeles
Appointed by the National Research Council, they were responsible for making certain that an independent examination of this report was carried out in accordance with
institutional procedures and that all review comments were carefully considered
Responsibility for the final content of this report rests entirely with the authoring
committee and the institution
As chair of the committee, I thank the extraordinary efforts of my fellow panel members, who had a broad range of expertise related to assessment, education policy, equity in education, mathematics and reading, program evaluation, social science, and statistics All of this expertise was critical to the multifaceted issues that had to be
addressed in this evaluation The panel members freely contributed their time to
accomplishing the myriad of tasks associated with assembling information and preparing this report They actively assisted in all stages of this project, including planning the meetings and the public forum, as well as writing and rewriting multiple versions of this report They gave generously of their time to ensure that the final product accurately represents our consensus findings, conclusions, and recommendations Their
contributions during the period in which the report was in final preparation and after the external review, when sections of the report had to be turned around on a very truncated schedule, are especially appreciated These efforts manifested the panel members’ deep dedication to improving student learning across the country
Christopher Edley, Jr., Chair
Trang 13Committee on the Evaluation of NAEP Achievement Levels for Mathematics and Reading
Trang 14Prepublication copy- Uncorrected proofs
Dedication
During the course of this project, the measurement community lost a towering figure Robert Linn, distinguished professor emeritus of education with the University of Colorado, Boulder, worked at the intersection of education measurement and education policy He was known for his enormous contributions to test equating, fairness, validity, performance assessment and educational accountability, and was widely respected for his ability to translate theory into practical advice His contributions to NAEP spanned several decades, and he continued to serve on the NAEP Validity Studies Panel until his final months His contributions to the NAS were numerous, serving as a board member and chair of BOTA, and a readily sought-after committee member and reviewer for many projects A prodigious scholar and remarkable intellectual leader, Bob was also
unceasingly patient and supportive as teacher, colleague, and friend We are indebted to his contributions and will miss him We dedicate this report to his memory
Trang 15Evolution of NAEP Achievement Levels
Evolution of Standard Setting
Evaluations of NAEP’s Standard Settings Setting Cut Scores
Evolution of Achievement-Level Descriptors Summary
3 Setting Achievement Levels: NAEP Process
5 Validity of the Achievement Levels
Concepts of Validity and Validation
Content-Related Validity Evidence
Trang 16Prepublication copy- Uncorrected proofs
6 Interpretations and Uses of NAEP Achievement Levels
Relevant Standards and Committee Sources
Intended Users, Interpretations, and Uses
NAEP Guidance for Users
Actual Uses of Achievement Levels
Potential Misinterpretations and Misuses of Achievement Levels Conclusions
Annex: Bibliography
7 Setting New Standards: Considerations
Options for Handling Change
Mathematics
Reading
Digitally Based Assessment
Weighing the Options
8 Conclusions and Recommendations
A Agenda for Public Forum
B Biographical Sketches of Committee Members and Staff
Trang 17Summary
The National Assessment of Educational Progress (NAEP) has been providing policy makers, educators, and the public with reports on the academic performance and progress of the nation’s students since 1969 The assessment is given periodically in a variety of subjects: mathematics, reading, writing, science, the arts, civics, economics, geography, U.S history, and technology and engineering literacy NAEP is often
referred to as the nation’s report card because it reports on the educational progress of the nation as a whole The assessment is not given to all students in the country, and scores are not reported for individual students Instead, the assessment is given to representative samples of students across the United States, and results are reported for the nation and for specific groups of students
Since 1983, the results have been reported as average scores on a scale ranging from 0 to 500 Until 1992, results were reported on this scale for the nation as a whole and for students grouped by sociodemographic characteristics, such as by gender, race and ethnicity, and socioeconomic status Beginning in 1993, results were reported
separately by state, and beginning in 2002, also for some urban school districts
Over time, there has been growing interest in comparing educational progress across the states At the same time, there has been increasing interest in having the results reported in a way that policy makers and the public could understand and so that they could be used to examine students’ achievement in relation to high, world-class standards
By 1989, there was considerable support for changes in the way that NAEP results were reported
In part in response to these interests, the idea of reporting NAEP results using achievement levels was first raised in the late 1980s The Elementary and Secondary Education Act of 1988, which authorized the formation of the National Assessment
Governing Board (NAGB), delegated to NAGB the responsibility of “identifying
appropriate achievement goals” (P.L 100-297, Part C, Section 3403 (6) (A)) The
decision to report NAEP results in terms of achievement levels was based on the
NAGB’s interpretation of this legislation
In a 1990 policy statement, NAGB established three “achievement levels”: basic, proficient, and advanced The NAEP results would henceforth report the percentage of test takers by achievement level The percentage of test takers who scored below the basic level would also be reported These new reports would be in addition to summary statistics on the score scale
After a major standard setting process in 1992, NAEP began reporting results in relation to the three achievement levels However, the use of achievement levels has provoked controversy and disagreement, and evaluators have identified numerous
concerns When NAEP was reauthorized in 1994, Congress stipulated that until an
Trang 18Prepublication copy- Uncorrected proofs
evaluation determined that the achievement levels are reasonable, reliable, valid, and informative to the public, they were to be designated a trial—a provisional status that still remains, 22 years later
In 2014 the U.S Department of Education, through its Institute for Education Sciences, sought to reexamine the need for this provisional status and contracted with the National Academy of Sciences appoint a committee of experts to carry out that
examination The committee’s charge was to determine if the NAEP achievement levels
in reading and mathematics are reasonable, reliable, valid, and informative to the public More specifically, it was to evaluate the student achievement levels used in reporting NAEP results, the procedures for setting those levels, and how they are used (see Chapter
1 for the complete charge)
In addressing its charge, the committee focused on process, outcomes, and uses That is, we evaluated (1) the process for conducting the standard setting; (2) the technical properties of the outcomes of standard setting (the cut scores and the achievement-level descriptors); and (3) the interpretations and uses of the achievement levels
SETTING STANDARDS
In developing achievement levels, NAGB first needed to set standards, a process that involves determining “how good is good enough” in relation to one or more criterion measures For instance, in education one commonly used criterion is how good is good enough to attain an A; in employment settings, it would be the minimum test score needed to become certified or licensed to practice in a given field (whether plumbing or medicine)
To set achievement standards for NAEP, two questions had to be answered: What skills and knowledge do students need in order to be considered basic, proficient, and advanced in each subject area? What scores on the test indicate that a student has
attained one of those achievement levels?
All standard setting is based on judgment For a course grade, it is the judgment
of the classroom teacher For a licensure or certification test, it is the judgment of
professionals in the field For NAEP, it is more complicated As a measure of
achievement for a cross-section of U.S students, NAEP’s achievement levels need to reflect common goals for student learning, despite the fact that students are taught
according to curricula that vary across states and districts To accommodate these
differences, NAEP’s standards need to reflect a wide spectrum of judgments Hence, NAGB sought feedback from a wide range of experts and stakeholders in setting the standards: educators, administrators, subject-matter specialists, policy makers, parent groups, and professional organizations, as well as the general public
Through the standard setting process, NAGB adopted a set of achievement levels for each subject area and grade The achievement levels include a description of the knowledge and skills necessary to perform at a basic, proficient, and advanced level as well as the “cut score,” the minimum score needed to attain each achievement level
Trang 19FINDINGS AND CONCLUSIONS
The Process
In setting standards for the 1992 reading and mathematics assessments, NAGB broke new ground While standard setting has a long history, its use in the area of
educational achievement testing—and to set multiple standards for a given assessment—
was new While the Standards for Educational and Psychological Testing in place at the
time provided guidance for some aspects of 1992 standard setting, many of the
procedures used were novel and untested in the context of achievement testing for
kindergarten through 12th grade (K-12)
In carrying out the process, NAGB sought advice and assistance from many
measurement and subject-matter experts, including staff of the standard setting contractor,
an advisory group of individuals with extensive standard setting expertise, and NAGB’s own advisers In addition, a panel of members of the National Academy of Education (NAEd) evaluated the work being done
The NAEd evaluators raised questions about the integrity and validity of the process Perhaps most importantly, they criticized the achievement-level descriptors, arguing that they were not valid representations of performance at the specified levels And they criticized the specific method for setting the cut scores, arguing that it was too cognitively complex, thus limiting the validity of the outcome
In spite of the NAEd evaluators’ concerns, NAGB moved forward with
achievement-level reporting for the 1992 assessments of mathematics and reading Since then, NAGB and NCES have sponsored research conferences, sought advice from experts
in standard setting, commissioned research, formed standing advisory groups, held
training workshops, and published materials on standard setting
For its review, the committee considered the Standards for Educational and
Psychological Testing and guidance available in 1992, along with what is known now In
examining the process, we considered the ways in which panelists were selected and trained, how method for setting the cut scores was selected and implemented, and how the achievement-level descriptors were developed Our key findings:
The process for selecting standard-setting panelists was extensive and, in our judgment, likely to have produced a set of panelists that represented a wide array of views and perspectives
In selecting a cut-score setting method, NAGB and ACT chose one method for the multiple-choice and short-answer questions and another for the extended-response questions This was novel at the time and is now widely recognized as a best practice
NAEP’s 1992 standard setting represented the first time that formal, written achievement-level descriptions were developed to guide the standard setting panelists This, too, was novel at the time and is now widely recognized as a best practice
Trang 20Prepublication copy- Uncorrected proofs
CONCLUSION 3-1 The procedures used by the National Assessment
Governing Board for setting the achievement levels in 1992 are well
documented The documentation includes the kinds of evidence called for in
the Standards for Educational and Psychological Testing in place at the time
and currently and was in line with the research and knowledge base at the time
Outcomes
The standard-setting process used for NAEP began with the frameworks (or blueprints for the mathematics and reading assessments), a general policy description of what each level is intended to represent (e.g., mastery over challenging subject matter), and a set of items that have been used to assess the knowledge and skills elaborated in the assessment frameworks The standard setting process produces two key outcomes The first outcome is a set of detailed achievement-level descriptors, specifying the knowledge and skills required at each of the achievement levels The second outcome is the cut score that indicates the minimum scale score value for each achievement level The
achievement levels defined by these cut scores provide the basis for using and
interpreting test results, and thus, the validity of test score interpretations hinges on the appropriateness of the cut score
In evaluating these outcomes, the committee examined evidence of their
reliability and validity In the context of standard setting, reliability is a measure of the consistency, generalizability, and stability of the judgments (i.e., cut scores) Reliability estimates indicate the extent to which the cut-score judgments are likely to be consistent across replications of the standard setting, such as repeating the standard setting with different panelists, different test questions, on different occasions, or with different methods
NAGB conducted studies to collect three kinds of reliability evidence:
interpanelist agreement; intrapanelist consistency across items of different types; and the stability of cut scores across occasions The actual values of the estimates of consistency suggest a considerable amount of variability in cut-score judgments The available documentation notes that this issue received considerable attention, but the sources and the effects of the variability were not fully addressed before achievement-level results were reported We are hesitant to make judgments about the rationale for decisions made long ago; at the same time, we acknowledge that some of these issues warranted further investigation
CONCLUSION 4-1 The available documentation of the 1992 standard
settings in reading and mathematics include the types of reliability analyses
called for in the Standards for Educational and Psychological Testing that
were in place at the time and those that are currently in place The evidence that resulted from these analyses, however showed considerable variability among panelists’ cut-score judgments: the expected pattern of decreasing variability among panelists across the rounds was not consistently achieved; and panelists' cut-score estimates were not consistent over different item
Trang 21formats and different levels of item difficulty These issues were not resolved before achievement-level results were released to the public
Validation in the context of standard setting usually consists of demonstrating that the proposed cut score for each achievement level corresponds to the achievement-level descriptor and that the achievement levels are set at a reasonable level, not too low or too high Accordingly, studies were conducted to provide evidence of validity related to test content and relationships with external criteria
With regard to content-related validity evidence, the studies focused on the
alignment between the achievement-level descriptors and cut scores, the frameworks, and the test questions For these studies, a second and sometimes third group of panelists were asked to review the achievement-level descriptors and cut scores produced by the initial standard setting As a result of these reviews, changes were made to the
achievement-level descriptors: some were suggested to NAGB by the panelists; others were made by NAGB
Since 1992, changes have been made to the mathematics and reading frameworks, the assessment tasks, and the achievement level descriptors—most notably, in 2005 and
2009 With the exception of grade-12 mathematics in 2005, no changes have been made
in the cut scores Moreover, the grade-12 descriptors for mathematics were changed in
2005 and 2009, but related changes were not made to those for grades 4 and 8
Consequently, the final descriptors were not the ones that panelists used to set the cut scores
CONCLUSION 5-1 The studies conducted to assess content validity are in
line with those called for in the Standards for Educational and Psychological
Testing in place in 1992 and currently in 2016 The results of these studies
suggested that changes in the achievement-level descriptors were needed, and they were subsequently made These changes may have better aligned the descriptors to the framework and exemplar items, but as a consequence, the final achievement-level descriptors were not the ones used to set the cut scores Since 1992, there have been additional changes to the frameworks, the item pools, the assessments, and studies to identify needed revisions to the achievement level descriptors But, to date, there has been no effort to set new cut scores using the most current achievement level descriptors 1
CONCLUSION 5-2 Changes in the NAEP mathematics frameworks in 2005 led to new achievement level descriptors and a new scale and cut-scores for the achievement levels at 12th grade, but not for the 8th and 4th grades These changes create a perceived or actual break between 12th-grade
mathematics and 8th- and 4th-grade mathematics Such a break is at odds with contemporary thinking in mathematics education, which holds that school mathematics should be coherent across grades
1 This text was revised after the report was initially transmitted to the U.S Department of
Education; see Chapter 1 (“Data Sources”)
Trang 22Prepublication copy- Uncorrected proofs
Criterion-related validity evidence usually consists of comparisons with other indicators of the content and skills measured by an assessment, in this case, other
measures of achievement in reading and mathematics The goal is to help to evaluate the extent to which achievement levels are reasonable and set at an appropriate level
It can be challenging to identify and collect the kinds of data that are needed to provide evidence of criterion-related validity The ACT reports that document the
validity of the achievement levels do not include results from any studies that compared NAEP achievement levels with external measures It is not clear why NAGB did not pursue such studies In contrast, the NAEd reports include a variety of such studies, such
as comparisons with state assessments, international assessments, advanced placement (AP) tests, and college admissions tests They further conducted a special study in which 4th- and 8th-grade teachers classified their own students into the achievement-level categories by comparing the achievement-level descriptors with the students’ classwork
We examined evidence from similar sources for our evaluation, and consider it to be of
high value in judging the reasonableness of the achievement levels
Our comparisons reveal considerable correspondence between the percentages of students at NAEP achievement levels and the percentages on other assessments These studies show that the NAEP achievement-level results (the percentage of students at the advanced level) are generally consistent with the percentage of U.S students scoring at the reading and mathematics benchmarks on the Programme for International Student Assessment (PISA), the mathematics benchmarks on Trends in International Mathematics and Science (TIMSS), and at the higher levels for advanced placement exams These studies also show that significant numbers of students in other countries score at the equivalent of the NAEP advanced level
CONCLUSION 5-3 The Standards for Educational and Psychological Testing
in place in 1992 did not explicitly call for criterion-related validity evidence for achievement level setting, but such evidence was routinely examined by testing programs The National Assessment Governing Board did not report information on criterion-related evidence to evaluate the reasonableness of the cut scores set in 1992 The National Academy of Education evaluators reported four kinds of criterion-related validity evidence, and they concluded that the cut scores were set very high We were not able to determine if this evidence was considered when the final cut scores were adopted for NAEP
Recent research has focused on validity evidence based on relationships with external variables, that is setting benchmarks on NAEP that are related to concurrent or future performance on measures external to NAEP The findings from this research can
be used to evaluate the validity of new interpretations of the existing achievement levels, suggest possible adjustments to the cut scores or descriptors, or enhance understanding and use of the achievement level results This research can also help establish specific benchmarks that are separate from the existing achievement levels, such as college
readiness
CONCLUSION 5-4 Since the NAEP achievement levels were set, new
research has investigated the relationships between NAEP scores and
Trang 23external measures, such as academic preparedness for college The findings from this research can be used to evaluate the validity of new interpretations
of the existing performance standards, suggest possible adjustments to the cut scores or descriptors, and or enhance understanding and use of the
achievement level results This research can also help establish specific
benchmarks that are separate from the existing achievement levels This type
of research is critical for adding meaning to the achievement levels
Interpretation and Use
Originally, NAEP was designed to measure and report what U.S students actually
know and are able to do However, the achievement levels were designed to lay out what
U.S students should know and be able to do That is, the adoption of achievement levels
added an extra layer of reporting to reflect the nation’s aspirations for students Reporting NAEP results as the percentage of students who scored at each achievement level was intended to make NAEP results more understandable This type of reporting was
designed to clearly and succinctly highlight the extent to which U.S students are meeting expectations
The committee was unable to find any official documents that provide guidance
on the intended interpretations and uses of NAEP achievement levels, beyond the brief statements in two policy documents The committee was also unable to find documents that specifically lay out appropriate uses and the associated research to support these uses
We found a disconnect between the kind of validity evidence that has been collected and the kinds of interpretations and uses that are made of NAEP’s reported results That is, although the committee found evidence for the integrity and accuracy of the procedures used to set the achievement levels, the evidence does not extend to the uses of the
achievement levels—the way that NAEP audiences use the results and the decisions they base on them
The committee found that considerable information is provided to state and
district personnel and the media in preparation for a release of NAEP results, and NAGB provided us with examples of these materials However, this type of information was not easy to find on the NAEP website
The many audiences for NAEP achievement levels use them in a variety of ways, including to inform public discourse and policy decisions, as was the original intention However, interpretive guidance provided to users is inconsistent and fragmented Some audiences receive considerable guidance just prior to a release of results For audiences that obtain most of their information from the website or hard-copy reports for the
general public, interpretative guidance is hard to locate Without appropriate guidance, misuses are likely The committee found numerous types of inappropriate inferences
CONCLUSION 6-1 NAEP achievement levels are widely disseminated to and used by many audiences, but the interpretive guidance about the meaning and appropriate uses of those levels provided to users is inconsistent and
piecemeal Without appropriate guidance, misuses are likely
Trang 24Prepublication copy- Uncorrected proofs
CONCLUSION 6-2 Insufficient information is available about the intended
interpretations and uses of the achievement levels and the validity evidence that support these interpretations and uses There is also insufficient
information on the actual interpretations and uses commonly made by
NAEP’s various audiences and little evidence to evaluate the validity of any
of them
CONCLUSION 6-3 The current achievement-level descriptors may not
provide users with enough information about what students at a given level know and can do The descriptors do not clearly provide accurate and
specific information about the things that students at the cut score for each level know and can do
The committee recognizes that the achievement levels are a well- established part
of NAEP, with wide influence on state K-12 achievement tests Making changes to
something that has been in place for over 24 years would likely have a range of
consequences that cannot be anticipated We also recognize the difficulties that might be created by setting new standards, particularly the disruptions that would result from
breaking people’s interpretations of the trends We also note that during their 24 years they have acquired meaning for NAEP’s various audiences and stakeholders: they serve
as stable benchmarks for monitoring achievement trends, and they are widely used to inform public discourse and policy decisions Users regard them as a regular, permanent feature of NAEP reports
To date, the descriptors for grade 12 mathematics and grades 4, 8, and 12 reading have been revised and updated as recently as 2009, but no changes have been made to the descriptors for mathematics in grades 4 and 8 since 2004
We considered several courses of action, ranging from recommending no changes
to recommending a new standard setting We concluded that most of the strongest
criticisms of the current standards—and the argument for completely new standards—can
be addressed instead by revision of the achievement-level descriptors
CONCLUSION 7-1: The cut scores for grades 4 and 8 in mathematics and all grades in reading were set more than 24 years ago Since then, there have been many adjustments to the frameworks, item pools, assessments, and achievement level descriptors, but there has been no effort to set new cut scores for these assessments While priority has been given to maintaining the trend lines, it is possible that there has been “drift” in the meaning of the cut scores such that the validity of inferences about trends is questionable The situation for grade 12 mathematics is similar, although possibly to a lesser extent since the cut scores were set more recently (in 2005) and thus far, only one round of adjustments have been made (in 2009) 2
CONCLUSION 7-2 Although there is evidence to support conducting a new standard setting at this time for all grades in reading and mathematics,
2 This conclusion was added after the report was initially transmitted to the U.S Department of Education; see Chapter 1 (“Data Sources”)
Trang 25setting new cut scores would disrupt the NAEP trend line at a time when many other contextual factors are changing In the short term, the disruption
in the trend line could be avoided by continuing to follow the same cut scores but ensuring the descriptions are aligned with them In particular, work is needed to ensure that the mathematics achievement-level descriptors for grades 4 and 8 are well aligned with the framework, cut scores, and item pools Additional work to evaluate the alignment of the items and the
achievement level descriptors for grade 4 reading and grade 12 mathematics
is also needed This work should not be done piecemeal, one grade at a time; rather, it should be done in a way that maintains the continuum of skills and knowledge across grades 3
RECOMMENDATIONS
The panel’s charge included the question of reasonableness NAEP and its
achievement levels loom large in public understanding of critical debates about education, excellence, and opportunity One can fairly argue that the “nation’s report card” is a success for that reason alone Through 25 years of use, the NAEP achievement levels have acquired a “use validity” or reasonableness by virtue of familiarity
In the long term, we recommend a thorough revision of the achievement-level descriptions that are informed by a suite of education, social, and economic outcomes important to key audiences We envision a set of descriptions that correspond to a few salient outcomes, such as college readiness or international comparisons The studies we recommend, however, would also offer ways to characterize other scale score
points This information should be available to the public along with test item
exemplars The more audiences understand the scale scores, the less likely they are to misuse the achievement levels
Setting new cut scores at this time, when so many things are in flux, would likely create considerable confusion about their meaning We do not encourage a new standard setting at this time But, we note that at some point the balance of concerns will tip to favor new standard setting procedures There will be evolution in the methodology,
assessment frameworks, the technology of test administration and hence the nature of items, and more We suggest that the U.S Department of Education state an intention to revisit this issue in some stated number of years We offer specific recommendations below
RECOMMENDATION 1 Alignment among the frameworks, the item pools, the achievement-level descriptors, and the cut scores is fundamental to the validity of inferences about student achievement In 2009, alignment was evaluated for all grades in reading and for grade 12 in mathematics, and changes were made to the achievement-level descriptors, as needed Similar research is needed to evaluate alignment for the grade 4 and grade 8
mathematics assessments and to revise them as needed to ensure that they represent the knowledge and skills of students at each achievement level
3 This conclusion was revised after the report was initially transmitted to the U.S Department of Education; see Chapter 1 (“Data Sources”)
Trang 26Prepublication copy- Uncorrected proofs
Moreover, additional work to verify alignment for grade 4 reading and grade
12 mathematics is needed 4
RECOMMENDATION 2 Once satisfactory alignment among the
frameworks, the item pools, the achievement-level descriptors, and the cut scores in NAEP mathematics and reading has been demonstrated, their designation as trial should be discontinued This work should be completed and the results evaluated as stipulated by law 5 : (20 U.S Code 9622: National
Assessment of Educational Progress:
https://www.law.cornell.edu/uscode/text/20/9622 [September 2016])
RECOMMENDATION 3 To maintain the validity and usefulness of
achievement levels, there should be regular recurring reviews of the
achievement-level descriptors, with updates as needed, to ensure they reflect both the frameworks and the incorporation of those frameworks in NAEP assessments 6
RECOMMENDATION 4 Research is needed on the relationships between the NAEP achievement levels and concurrent or future performance on measures external to NAEP Like the research that led to setting scale scores that represent academic preparedness for college, new research should focus
on other measures of future performance, such as being on track for a
college-ready high school diploma for 8th-grade students and readiness for middle school for 4th-grade students 7
RECOMMENDATION 5 Research is needed to articulate the intended
interpretations and uses of the achievement levels and collect validity
evidence to support these interpretations and uses In addition, research to
identify the actual interpretations and uses commonly made by NAEP’s
various audiences and evaluate the validity of each of them This
information should be communicated to users with clear guidance on
substantiated and unsubstantiated interpretations 8
RECOMMENDATION 6 Guidance is needed to help users determine
inferences that are best made with achievement levels and those best made
4 Recommendation 1 was revised after the report was initially transmitted to the U.S Department
of Education; see Chapter 1 (“Data Sources”)
5 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
6 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
7 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
8 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
Trang 27with scale score statistics Such guidance should be incorporated in every report that includes achievement levels 9
A number of aspects of the NAEP reading and mathematics assessments have changed since 1992: the constructs and frameworks; the types of items, including more constructed-response questions; the ways of reporting results; and the addition of
innovative web-based data tools NAEP data have also been used in new ways over the past 24 years, such as reporting results for urban districts, including NAEP in federal accountability provisions, and setting academic preparedness scores New linking studies have made it possible to interpret NAEP results in terms of the results of international assessments, and there are possibilities for linking NAEP 4th-and 8th-grade results to indicating being on track for future learning Although external to NAEP, major national initiatives have significantly altered state standards in reading and mathematics
These and other factors imply a changing context for NAEP Staying current with contemporary practices and issues while also maintaining the trend line for NAEP results are competing goals
RECOMMENDATION 7 NAEP should implement a regular cycle for
considering the desirability of conducting a new standard setting Factors to consider include, but are not limited to: substantive changes in the constructs, item types, or frameworks; innovations in the modality for administering assessments; advances in standard setting methodologies; and changes in the policy environment for using NAEP results These factors should be weighed against the downsides of interrupting the trend data and information 10
9 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
10 This recommendation was revised after the report was initially transmitted to the U.S
Department of Education; see Chapter 1 (“Data Sources”)
Trang 28Prepublication copy- Uncorrected proofs
1 Introduction
BACKGROUND
Since 1969 the National Assessment of Educational Progress (NAEP) has been providing policy makers, educators, and the public with reports on the academic
performance and progress of the nation’s students The assessment is given periodically
in a variety of subjects: mathematics, reading, writing, science, the arts, civics, economics, geography, U.S history, and technology and engineering literacy NAEP is often
referred to as the Nation’s Report Card because it reports on the educational progress of the nation as a whole The assessment is not given to all students in the country, and scores are not reported for individual students Instead, the assessment is given to
representative samples of students across the country, and scores are reported only for groups of students During the first decade of NAEP, results were reported for the nation
as a whole and for students grouped by social and demographic characteristics, including gender, race and ethnicity, and socioeconomic status
Over time, there was a growing desire to compare educational progress across the states At the same time, there was an increasing desire to report the results in ways that policy makers and the public could better understand and to examine students’
achievement in relation to world-class standards By 1988, there was considerable
support for these changes
The Elementary and Secondary Education Act of 1988 authorized the formation
of the National Assessment Governing Board (NAGB) and gave it responsibility for setting policy for NAEP and “identifying appropriate achievement goals” (Public Law 100-297, Part C, Section 3403 (6) (A)) In part in response to this legislation, NAGB decided to report NAEP results in terms of achievement levels, and that reporting began with the 1992 assessments
Three achievement levels are used: “basic,” “proficient,” and “advanced.”
Results are reported according to the percentage of test takers whose scores are at each achievement level, and brief descriptions of the levels are provided with the results.1 The percentage of test takers who score below the basic level also is reported: Figures 1-2 through 1-7 in the annex to the chapter show an example of this type of reporting
The purpose for setting achievement levels was explicitly stated (National
Assessment Governing Board, 1993, p 1) for NAEP:
1 We use the words “description” and “descriptor” interchangeably in discussing achievement levels, as is done in the field
Trang 29[to make it] unmistakably clear to all readers and users of NAEP data that these
are expectations which stipulate what students should know and should be able to
do at each grade level and in each content area measured by NAEP The
achievement levels make the NAEP data more understandable to general users, parents, policymakers and educators alike They are an effort to make NAEP part
of the vocabulary of the general public
In the ensuing two-and-a-half decades, the use of achievement levels has become
a fixture of the reporting of test results, not only for NAEP, but also for a number of other testing programs Notably, the No Child Left Behind Act of 2001 required that all states set achievement levels for their state tests, and many used the same names for
achievement levels as those used by NAEP
Given the potential value to the nation of setting achievement levels for NAEP and the importance of “getting it right,” the procedures and results have received
considerable scrutiny Before achievement-level results were reported for the 1992 mathematics and reading, numerous evaluations were conducted, including: Stufflebeam
et al (1991), Linn et al (1992a; 1992b), and Koretz and Deibert (1993), which focused
on the 1990 mathematics results; the U.S General Accounting Office (1993), which focused on the 1992 mathematics results; and Shepard et al (1993), which covered the
1992 mathematics and reading results
These reviews raised numerous concerns about the ways that NAEP’s
achievement levels had been developed and the extent to which they would support the intended interpretations The reviews generally concluded that (1) the achievement levels were not necessarily valid indicators of educators’ judgments of what students should know and be able to do and (2) the achievement-level descriptors were not accurate representations of the NAEP assessments and frameworks
When NAEP was up for reauthorization in 1994, Congress stipulated that the achievement levels be used only on a “trial basis until the Commissioner [of the National Center for Education Statistics] determines, as a result of an evaluation, that such levels are reasonable, valid, and informative to the public” (P.L 107-110 (2002), Sec 602(e)) Since that time, achievement-level reports have carried a footnote indicating that they are trial—a provisional status that has remained for 25 years
In 2014 the U.S Department of Education sought to reexamine the need for the provisional status of the NAEP achievement levels, and it requested a study under the auspices of the National Academy of Science, Engineering, and Medicine to conduct this examination The work was carried out under the auspices of two standing activities in the division of Behavioral and Social Sciences and Education: the Board on Testing and Assessment and the Committee on National Statistics Together, the two boards
established the Committee on the Evaluation of NAEP Achievement Levels in Reading and Math The 15 committee members, who served as volunteers, had a broad range of expertise related to assessment, education policy, mathematics and reading, program evaluation, social science, and statistics
The following statement of task guided the committee’s work:
Trang 30Prepublication copy- Uncorrected proofs
An ad hoc committee will conduct an evaluation of the student achievement levels that are used in reporting results of the NAEP assessments in reading and
mathematics in grades 4, 8, and 12 to determine whether the achievement levels are reasonable, reliable, valid, and informative to the public The committee will review the achievement level setting procedures used by the National Assessment Governing Board and the National Center for Education Statistics, the ways that the achievement levels are used in NAEP reporting, the interpretations made of them and the validity of those interpretations, and the research literatures related
to setting achievement levels and reporting test results to the public The
committee will write a final report that describes its findings about the
achievement levels and how the levels are used in NAEP reporting If warranted, the committee's report will also provide recommendations about ways that the setting and use of achievement levels for NAEP can be improved
To address its charge, the committee held three in-person meetings and four half-day virtual meetings during 2015 Before discussing our approach to the study, we provide some background on the process for developing achievement levels, or, more generally,
standard setting and on the key features of NAEP
STANDARD SETTING
Translating NAEP scale scores into achievement levels involves a process
referred to as standard setting, in which different levels of performance are identified and described, and the test scores that distinguish between the levels are determined Box 1-1 defines the terms often used with standard setting
Standard setting involves determining “how good is good enough” in relation to one or more achievement or performance measures For instance, educators use it
regularly to assign grades (e.g., how good is good enough for an A or a B) In
employment settings, it has long been used to determine the minimum test score needed
to become certified or licensed to practice in a given field: examples include medical board tests for doctors, licensing tests for nurses, bar examinations for lawyers, and
certification examinations for accountants In these education and employment examples,
a concrete criterion can be stated: What material should a student know and be able to do
to receive an A in a course? What should a prospective practitioner know and be able to
do to practice safely and effectively in a given field?
All standard setting is based on judgment For a course grade, it is the judgment
of the classroom teacher For a licensure or certification test, it is the judgment of
professionals who work in the field For NAEP, it is more complicated As a measure of achievement for a cross-section of U.S students, NAEP’s achievement levels need to reflect common goals for student learning—despite the fact that students are taught
according to curricula that vary across states and districts To accommodate this variation, NAEP’s standards needed to reflect a wide spectrum of judgments about what
achievement was intended by those who chose the curricula
For large-scale tests like NAEP, standard setting is a formal process, with
guidelines for how it should be done Generally, the process involves identifying a set of individuals with expertise in the relevant areas, recruiting them to serve as standard
Trang 31setting panelists, training them to perform the standard setting tasks, and implementing the procedures to obtain their judgments There are two outcomes of the process: (1) a detailed description of what students at each achievement level should know and can do; and (2) the cut score that marks the minimum score needed to be placed at a given
achievement level
There are many methods for conducting standard settings They differ in the nature of the tasks that panelists are asked to do and the types of judgments they are asked to make There is no single “right” method; the choice of method is based on the kinds of question formats (e.g., multiple choice, short answer, extended response),
various characteristics of the assessment and its uses, and often the experiential base of those conducting the standard setting Regardless of the method chosen, the most critical issue is that the process is carried out carefully and systematically, following accepted procedures
The achievement-level descriptions and corresponding cut scores are intended to reflect performance as defined through a subject-area framework (see below) Each assessment is built around an organizing framework, which is the blueprint that guides the development of the assessment instrument and determines the content to be assessed
For NAEP, the frameworks capture a range of subject-specific content and
thinking skills needed by students to deal with what they encounter, both inside and outside their classrooms The NAEP frameworks are determined through a development process designed to ensure that they are appropriate for current educational requirements Because the assessments must remain flexible to mirror changes in educational objectives and curricula, the frameworks are designed to be forward-looking and responsive,
balancing current teaching practices with research findings.2
There is an assumed relationship between a subject-area framework and other elements of the assessment First, given that the stated purpose of a framework is to guide the development of items for the assessment, there must be a close correspondence between a framework and the assessment items.3
Second, given that NAGB intends the achievement levels and the frameworks to remain stable over several test administrations (while actual items will change with each administration), the frameworks must serve as the pivotal link between the assessments and the achievement levels—both as they are described narratively and as they are
operationalized into item-level judgments and, ultimately, cut scores Ideally, this would mean that achievement-level descriptors are crafted concurrently with the framework (see
Shepard et al., 1993, p 47; also see Bejar et al., 2008)
In principle, then, the achievement-level descriptions should guide test
development so that the tests are well aligned to the intended constructs (concepts or characteristics) of interest The achievement-level descriptors guide standard setting so that panelists can operationalize them in terms of cut scores with the same
2 For NAEP’s description of the frameworks, see
http://nces.ed.gov/nationsreportcard/frameworks.aspx [January 2016] Item specifications for reading can
be found at https://www.nagb.org/publications/frameworks/reading/2009-reading-specification.html [August 2016] Item specifications for mathematics can be found at:
https://www.nagb.org/publications/frameworks/mathematics/2009-mathematics-specification.html [August 2016]
3 For each assessment, the framework is supplemented with a set of Assessment Specifications that provided additional details about developing and assembling a form of the test
Trang 32Prepublication copy- Uncorrected proofs
conceptualization used by item writers To the extent that tests are well aligned to the achievement-level descriptors, the latter reflect the knowledge and skills possessed by the students at or above each cut score Therefore, the descriptors used in score reporting actually represent the observed skills of students in a particular achievement-level
category (Egan et al., 2012; Reckase, 2001) Figure 1-1 shows the assumed relationships
KEY FEATURES OF NAEP
NAEP consists of two different programs: main NAEP and long- term trend
NAEP Main NAEP is adjusted as needed so that it reflects current thinking about
content areas, assessment strategies, and policy priorities, but efforts are made to
incorporate these changes in ways that do not disrupt the trend line Main NAEP is the subject of this report Long-term trend NAEP, in contrast, provides a way to examine achievement trends with a measure that does not change; these assessments do not
incorporate advances in content areas or assessment strategies Both main and long-term trend NAEP differ fundamentally from other testing programs in that its objective is to obtain accurate measures of academic achievement for groups of students rather than for individuals This goal is achieved using innovative sampling, scaling, and analytic
introduction of the TUDA, a sampling plan was created for each participating urban district At that point, for the states with TUDAs, the state sample was augmented by the TUDA sample However, the two could be separated for analysis purposes The national samples for NAEP are selected using stratified multistage sampling designs with three stages of selection: districts, schools, and students The result is a sample of about
150,000 students sampled from 2,000 schools The sampling design for state NAEP has only two stages of selection: schools and students within schools The results are
samples of approximately 3,000 students in 100 schools per state (roughly 100,000
students in 4,000 schools nationwide)
For the national assessment in 1992, approximately 26,000 4th-, 8th- and grade students in 1,500 public and private schools across the country participated in the national assessment For jurisdictions that participated in the separate state program, approximately 2,500 students were sampled from approximately 100 public schools for each grade and curriculum area Thus, a total of approximately 220,000 4th- and 8th-
12th-4 This section describes the sampling design for the math and reading assessments only
Trang 33grade students who were attending nearly 9,000 public schools participated in the 1992 trial state assessments In 1996, between 3,500 and 5,500 students were tested in
mathematics and science and between 4,500 and 5,500 were tested in reading and writing (Campbell et al., 1997)
Sampling of Items
NAEP assesses a cross-section of the content within a subject-matter area Due to the large number of content areas and subareas within them, NAEP uses a matrix
sampling design to assess student achievement in each subject Using this design,
different blocks of items drawn from the overall content domain are administered to different groups of students: this approach makes it possible to administer a large
number and range of items while keeping individual testing time to 1 hour for all subjects Students receive different but overlapping sets of NAEP items using a form of matrix
subsampling known as balanced incomplete block spiraling This design requires highly
complicated analyses and does not permit the performance of a particular student to be accurately measured Therefore, NAEP reports only group-level results; individual results are not reported
Analytic Procedures
Although individual results are not reported, it is possible to compute estimates of individuals’ performance on the overall assessment using complex statistical procedures The observed data reflect student performance over the particular NAEP blocks each student actually took Since no individual takes all NAEP blocks, statistical estimation procedures are used to derive estimates of individuals’ proficiency on the full
complement of skills and content covered by the assessment, on the basis of the test blocks that an individual took
The procedure involves combining samples of values drawn from distributions of possible proficiency estimates for each student These individual student distributions are estimated from their responses to the test items and from background variables The use
of background variables in estimating proficiency is called conditioning For each student, five values, called plausible values, are randomly drawn from the student’s distribution of
possible proficiency estimates.5 The plausible values are intended to reflect the
uncertainty in each student’s proficiency estimate, given the limited set of test questions taken by each student The sampling from the student’s distribution is an application of Rubin’s (1987) multiple imputation method for handling missing data (the responses to items not presented to the student are considered missing) In the NAEP context, this
process is called plausible values methodology (National Research Council, 1999)
Statistics Reported
NAEP currently reports student performance on the assessments using a scale that ranges from 0 to 500 for 4th- and 8th-grade mathematics and for all grades for reading Originally, 12th-grade mathematics results were also reported on this scale; the scale was
5 Beginning with the 2013 analysis, the number of plausible values was increased to twenty
Trang 34Prepublication copy- Uncorrected proofs
changed to a range of 0 to 300 when the framework was revised in 2004-2005 (see chapter 5) Scale scores summarize performance in a given subject area for the nation as
a whole, for individual states, and for subsets of the population defined by demographic and background characteristics Results are tabulated over time to provide trend
information
As described above, NAEP also reports performance using achievement levels
NAEP collects a variety of demographic, background, and contextual information from students, teachers, and administrators Student demographic and academic
information includes such characteristics such as race and ethnicity, gender, highest level
of parental education, and status as a student with disabilities or an English-language learner Contextual and environmental data provide information about students’ course selection, homework habits, use of textbooks and computers, and communication with parents about schoolwork Information obtained from teachers includes the training they received, the number of years they have taught, and the instructional practices they use Information obtained from administrators covers their schools, including the location and type of school, school enrollment numbers, and levels of parental involvement NAEP summarizes achievement results by these various characteristics.6
COMMITTEE APPROACH
The committee held three in-person meetings and four half-day virtual meetings during 2015 The first two meetings included public sessions at which the committee gathered a great deal of information At the first meeting, officials from NAGB and the National Center for Education Statistics (NCES) described their goals for the project and discuss the types of information available to the committee At the second meeting, a half-day public forum provided an opportunity for people to talk about how they interpret and use the reported achievement levels This public forum was organized as five panel discussions, each focused on a type of audience for NAEP results: journalists and
education writers; state policy users; developers of the assessments designed to be
aligned with the Common Core State Standards; research and advocacy groups; and a synthesis panel with two experts in standard setting: see Appendix A for the forum agenda
We designed our evaluation accordingly It is important to point out that the four factors—reasonable, reliable, valid, and informative—are interrelated and cannot be evaluated in isolation of each other In addition, they are connected by the underlying purpose(s) of achievement-level reporting: that is, what are the intended uses of
6 Results from the most recent tests can be found at
http://nces.ed.gov/nationsreportcard/about/current.aspx [February 2016]
Trang 35achievement levels? For example, in making judgments about reliability, one needs to consider what types of inferences will be made and the decisions and consequences that will result from those inferences The same is true for validity: valid for what inferences, what interpretations, and what uses?
Thus, the committee chose not to organize its work and this report around each factor separately Instead, we organized our review around three types of evidence to
support the validity and use of achievement level reporting: the process of setting
standards, the outcomes of the standard setting, the cut scores, and the achievement-level descriptors, and the interpretations and uses of achievement levels We identified a set
of questions to guide our data collection for each factor, and we considered this
information in relation to the Standards for Educational and Psychological Testing
(American Educational Research Association et al., (hereafter, Standards), both the most
current version (2014) and the version in use at the time the achievement levels were developed (1985) Box 1-2 lists the questions we posed for our work
Data Sources
To address its charge, the committee collected and synthesized a wide range of information, including the results of complex statistical analyses, reports of NAEP results, and commentary about them We used the following types of information:
general (publicly available) background information about NAEP, such as its purpose and intended uses, reading and mathematics frameworks, item pools and test specifications, item types, reporting mechanisms;
reports of results of NAEP assessments of reading and mathematics and the corresponding interpretation guidance;
interactive reporting tools available on the NAEP websites;
historical documentation of the policy steps involved in creating the
achievement levels for reading and mathematics;
technical documentation of standard setting procedures (setting the cut scores, developing detailed achievement-level descriptors), and associated analyses;
reports from prior external evaluations of the extent to which the achievement levels are reliable, valid, reasonable, and informative to the public;
reports from internal evaluations (generally by or for NAGB) of the extent to which the achievement levels are reliable, valid, reasonable, and informative
to the public, along with responses (and actions taken) to the criticisms made
in the external evaluations;
empirical studies of the reliability, validity, usefulness, and “informativeness”
of the achievement levels and achievement-level reporting, including studies
of the effects of using various procedures for establishing the cut scores;
professional standards with regard to standard setting (e.g., those of the
American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education;
edited volumes and other scholarly literature from well-known experts in the field of standard setting, summarizing research and empirical studies and offering insights and advice on best practice;
Trang 36Prepublication copy- Uncorrected proofs
subjective accounts about achievement-level reporting, including opinion pieces, commentaries, newspaper columns and articles, blogs, tweets,
conference and workshop presentations, and public forums; and
other reports prepared by research and policy organizations about specific aspects of achievement-level reporting for NAEP
Our use of these data sources varied When drawing conclusions from the
empirical evidence, we gave precedence to results from analyses published in reviewed journals However, as is often the case with evaluations, these types of reports were scarce To compensate for this limitation, the committee placed the greatest weight
peer-on evidence that could be corroborated through multiple sources
There have been numerous literature reviews summarizing the results of these studies and offering advice on best practices In order to be able to address our charge within the allotted time period, we reviewed several compendia and edited volumes by well-respected experts in the field of standard setting These documents are well known
in the measurement field; the editors and authors are well respected in the field of
standard setting, and the various chapters cover an array of views Specifically, we relied
on following documents, listed in chronological order:
a special issue of the Journal of Educational Measurement (National Council
on Measurement in Education, 1978)
Jaeger (1989)
a special issue of Educational Measurement: Issues and Practice (Nitko,
1991)
Crocker and Zieky (1995a, 1995b)
Bourque (1997) and Bourque and Byrd (2000)
department informed the committee that there were three other reports that had not been provided to us The reports document studies of the extent to which the item pool and the achievement level descriptors are aligned and the resulting changes made to improve alignment One study was conducted in 2003 (Braswell, J and Haberstroh, J., 2004, May); two others were conducted in 2009, when changes were made to the mathematics framework for grade 12 and the reading framework for all grades (4, 8, and 12) (Pitoniak
et al., 2010; Donahue et al., 2010).7 Because of this new information, the committee needed additional time to analyze the reports and incorporate conclusions from them in our report We indicate in Chapters 5, 7, and 8 where the text was changed after the original transmittal to reflect the new information
7 Copies of these studies can be obtained by the Public Access Records Office by phone 3543) or email (paro@nas.edu)
Trang 37(202-334-The Department of Education also noted several places in the report at which there were small factual errors Several recommendations were reworded because they misstated the agency responsible for a proposed action These changes were reviewed in accordance with institutional policy and are noted in the Summary and Chapter 8 We added a brief description of the item-rating process during the standard setting, which is noted in Chapter 3 And, as a result of some of these other changes, we added a new conclusion (7-1), which is noted in the Summary and Chapters 7 and 8
The late information received from the Department of Education has added to the richness of the information in this report However, it has not in any way changed the committee’s fundamental conclusions and recommendations
Guide to the Report
The report is organized around our evaluation questions Chapter 2 provides additional context about the origin of achievement levels for NAEP, the process for determining them, and the evaluations of them It also discusses changes in the field of standard setting over the past 25 years Chapter 3 discusses the evidence we collected on the process for setting achievement levels
Chapter 4 and 5, respectively, consider the evidence of the reliability and validity
of the achievement levels Chapter 6 discusses the intended uses of the achievement levels and the evidence that supports those uses, along with the actual uses of the levels and common misinterpretations
In Chapters 7 we explore the issues to consider in deciding whether a new
standard setting is needed, and we present our recommendations in Chapter 8
Trang 38Prepublication copy- Uncorrected proofs
BOX 1-1 Key Terms in Standard Setting
Unless specifically noted, the definitions below are excerpted and adapted from
Standards for Educational and Psychological Testing (American Educational Research
Association, American Psychological Association, and National Council on Measurement
in Education, 2014)
Achievement level (also called performance level, proficiency level, performance
standard): Label or brief statement classifying a test taker’s competency in a particular domain, usually defined by a range of scores on a test For example, labels such as basic
to advanced or novice to expert constitute broad ranges for classifying proficiency (pp
215, 221)
Achievement level descriptor (also called performance level descriptor, proficiency
level descriptor; descriptor and description are often used interchangeably): Qualitative descriptions of test takers’ levels of competency in a particular area of knowledge or skill, usually defined in terms of categories ordered on a continuum The categories constitute
broad ranges for classifying test takers’ performance (p 215)
Claim: A statement (inference, interpretation) made about students’ knowledge and skills
based on evidence from test performance (Kane, 2006, p.27)
Content domain: The set of behaviors, knowledge, skills, abilities, attitudes, or other
characteristics to be measured by a test, represented in detailed test specifications and often organized into categories by which items are classified (p 218)
Construct: The concept or characteristic that a test is designed to measure, such as
mathematics or reading achievement (p 217)
Cut score: A specific point on a score scale, such that scores at or above that point are
reported, interpreted, or acted upon differently from scores below that point (p 218)
Framework: A detailed description of the construct to be assessed, delineating the scope
of the knowledge, skills, and abilities that are represented The framework includes the blueprint for the test that is used to guide the development of the questions and tasks, determine response formats (multiple-choice, constructed response), and scoring
procedures (p 11)
Performance standard: Statements of what test takers at different performance levels
know and can do, along with cut scores or ranges of scores on the scale of an assessment that differentiate levels of performance A performance standard consists of a cut score and a descriptive statement (p 221)
Trang 39Standard setting: The process, often judgment based, of setting cut scores using a
structured procedure that seeks to map test scores into discrete performance levels that are usually specified by performance-level descriptors (p 224)
SOURCE: Adapted from American Educational Research Association et al (2014)
Trang 40Prepublication copy- Uncorrected proofs
BOX 1-2 Evaluation Questions
1 Why was achievement-level reporting implemented? What was it intended to
accomplish? How are achievement levels intended to be interpreted and used? What inferences are they intended to support? What comparisons are appropriate?
2 What validity evidence exists that demonstrates these interpretations and uses are supportable?
a What evidence exists to document that the achievement-level-setting process was handled in a way likely to produce results that support the intended
interpretations and uses?
b What evidence exists to document that the achievement levels represent the content and skills they were intended to represent (content- or construct-based validity)? Was the process handled in a way likely to produce consensus among the panelists? To what extent is there congruence between the
frameworks, test questions, achievement-level descriptors, and cut scores?
c What evidence exists to document that the cut scores are appropriate, not too high nor too low? To what extent do relationships with other variables suggest that the cut scores are appropriate and support the intended interpretations and uses (criterion-related validity)?
3 Was the overall process for determining achievement levels—their descriptions, the designated levels (basic, proficient, advanced), and cut scores—reasonable and
sensible? Did it follow generally accepted procedures (at the time the achievement levels were set and also according to the current state of the field and knowledge base)?
4 Did the process yield a reasonable set of cut scores? Do the score distributions (the percentage at each achievement level) seem reasonable, given what is known about student achievement from other sources? Are the higher levels (proficient, advanced) attainable?
5 What questions do stakeholders want and need NAEP to answer? Do level reports respond to these wants and needs? Do achievement-level reports respond
achievement-to these wants and needs better than reports on other metrics (e.g., summaries of scale scores)?
6 What are common interpretations and uses? What are common misinterpretations and misuses? What guidance is provided to help users interpret achievement-level results?