Alignment of Standards, Large-scale Assessments, and Curriculum: A Review of the Methodological and Empirical Literature Meagan KarvonenWestern Carolina UniversityShawnee Wakeman and Cla
Trang 1Alignment of Standards, Large-scale Assessments, and Curriculum:
A Review of the Methodological and Empirical Literature
Meagan KarvonenWestern Carolina UniversityShawnee Wakeman and Claudia FlowersUniversity of North Carolina at Charlotte
Support for this research is provided by the National Alternate Assessment Center (www.naacpartners.org) a five-year project funded by the U.S Department of
Education, Office of Special Education Programs (No H324U040001) The NAAC represents a collaborative effort between the University of Kentucky, University of North Carolina at Charlotte (UNCC), National Center on Educational Outcomes
(NCEO), the Center for Applied Special Technology (CAST), and the University of Illinois at Urbana-Champaign The opinions expressed do not necessarily reflect the position or policy of the Department of Education, and no official endorsement should be inferred
Trang 2AbstractThe purpose of this study was to provide a comprehensive review of the literature
on the alignment of academic content standards, large-scale assessments, and curriculum After reviewing the characteristics of 195 identified resources on
alignment published between 1984 and 2005, this review primarily focused on (1) a comparison of features of alignment models and their methodologies, and (2) a narrative and quantitative analysis of characteristics of 67 empirical alignment studies Based on this review, several recommendations for further research and improvements in alignment technology were made
Trang 3Alignment of Standards, Large-scale Assessments, and Curriculum:
A Review of the Methodological and Empirical LiteratureThe educational community sometimes assumes that instructional systems are driven by content standards, which are translated into assessment, curriculum materials, instruction, and professional development Research has shown that teachers may understand what content is wanted and believe they are teaching that content, when in fact they are not (Cohen, 1990; Porter, 2002) Improvements
in student learning depends on how well assessment, curriculum, and instruction are aligned and reinforce a common set of learning goals, and on whether
instruction shifts in response to the information gained from assessments (National Research Council, 2001) Alignment is often difficult to achieve because educational decisions are frequently made at different levels of the educational agency For example, states may have one set of experts who develop written standards, a second set of experts who develop the assessment, and a third set of experts who train teachers in standards-based instruction Finally, it is teachers who translate academic standards into instruction
In 1994 the Improving America’s Schools Act and Title I of the Elementary and Secondary Education Act required states to set high expectations for student learning, to develop assessments that measure those expectations, and to create systems that hold educators accountable for student achievement The No Child Left Behind Act (2002) reiterated this emphasis on quality assessment of student achievement; final NCLB regulations require that states’ assessment systems
“address the depth and breadth of the State’s academic content standards; are
valid, reliable, and of high technical quality; and express results in terms of the State’s academic achievement standards” (55 Fed Reg 45038, emphasis added)
Trang 4NCLB peer review guidance (U.S Department of Education, 2004) indicates that judgments about the compliance of states’ assessments systems with Title I
requirements will be made based on evidence submitted by states (e.g., alignment
studies) rather than assessments themselves The Guidance further recommends
that states consider the following points about their assessments:
o Cover the full range of content specified in the State’s academic content standards, meaning that all of the standards are represented legitimately in
the assessments; and
o Measure both the content (what students know) and the process (what
students can do) aspects of the academic content standards; and
o Reflect the same degree and pattern of emphasis apparent in the academic content standards (e.g., if the academic content standards place a lot of
emphasis on operations then so should the assessments); and
o Reflect the full range of cognitive complexity and level of difficulty of the concepts and processes described, and depth represented, in the State’s academic content standards, meaning that the assessments are as
demanding as the standards; and
o Yield results that represent all achievement levels specified in the State’s academic achievement standards (U.S Department of Education, 2004, p 41)
These issues should be considered in the alignment of the state’s entire assessmentsystem, including assessments for students with disabilities and English language learners Low complexity methods, such as simply mapping assessment items back
to state content standards, are insufficient for peer review purposes (U.S
Department of Education, 2004, p 41)
Trang 5Alignment can be formally defined as the degree of agreement, overlap, or intersection between standards, instruction, and assessments In other words, alignment is the match between the written, taught, and tested curriculum (Flowers,Browder, Ahlgrim-Delzell, & Spooner, in press) Accurate inferences about student achievement and growth over time can only be made when there is alignment between the standards (expectations) and assessments From this perspective, alignment has both content and consequential validity implications (Bhola, Impara,
& Buckendahl, 2003; LaMarca, Redfield, Winter, Bailey, & Despriet, 2000)
The consequences of poorly aligned standards, assessments, and curriculum are potentially significant for students and educational systems Aligning curriculumwith assessments can result in improved test scores for students regardless of background variables such as socioeconomic status, race, and gender In contrast, misalignment may reinforce differences among students based on their
sociocultural backgrounds, as those with more exposure to educational
opportunities in their everyday lives may still perform well when tests measure content that is not taught in the classroom (English & Steffy, 2001) Strong evidence
of alignment between assessments and state standards supports the validity of interpretations made about test scores
For many years, states and test developers have relied on content experts and other item reviewers to make judgments about whether test items reflect the content of particular strands within state content standards The AERA position statement on high-stakes testing calls for alignment of assessments and curriculum
on the basis of both content and cognitive processes (AERA, 2000) Bhola et al (2003) emphasized the need to use more complex methods for examining
alignment that go beyond content and cognitive process at the item level La Marca
Trang 6et al (2000) reviewed and synthesized conceptualizations of alignment and
methods for analyzing the alignment between standards and assessment They identified five dimensions that should be considered, based largely on Webb’s
(1999) work:
1 Content match, or the correspondence of topics and ideas in the standards and the assessment,
2 Depth match, or level of cognitive complexity required to demonstrate
knowledge and transfer it to different contexts,
3 Relative emphasis on certain types of knowledge tasks in the standards and the assessment system,
4 Match between the assessment and standards in terms of performance
expectations, and
5 Accessibility of the assessment and standards, so both are challenging for all students yet also fair to students at all achievement levels
The emphasis in this study is on the methodologies used to empirically
investigate alignment, and on the existing empirical evidence that might indicate what degree of alignment has been achieved in large-scale assessment systems In addition to the focus on alignment of standards and assessments emphasized by La Marca et al (2000) and Webb (1999), this study examines the alignment of
standards and assessments with the curriculum taught in schools This review and synthesis of literature is intended to yield information about gaps in methodological approaches to examining alignment, as well as areas in which additional empirical investigations are needed to establish sound criteria for judging the quality of
alignment
Methods
Trang 7This section describes the literature search and identification procedures, primary and secondary coding procedures, and data analysis strategies.
Literature Search and Identification Procedures
Cooper (1989) warned against overly narrow problem formations in the early
stages of a literature review, as limited conceptual breadth poses a threat to the validity of the study Thus, the scope of the literature search was initially very broad.Literature written between 1984 and 2005 that had a primary focus of alignment was the target of the search The scope of the alignment included measures
between (1) assessment and curriculum / instruction, (2) assessment and content standards, (3) content standards and curriculum / instruction, (4) instruction and instructional materials, (5) measures of alignment between two types of standards, and (6) a combination of assessment, content standards, and curriculum /
instruction Assessments included both general and special education instruments that were either objective or alternative (e.g., performance-based, portfolio)
Classroom and district-level assessments were excluded from this study, but
alignment in higher education settings was included Studies on alignment based onstandards at any level (e.g., district, state) were included
A total of 28 terms or combinations of terms were used to define the researchbase of alignment resources (e.g., sequential development; alignment and
curriculum; accountability, alignment, and assessment) Electronic and print
resources were used to identify materials for possible inclusion Electronic
databases searched included InfoTrac, Google, ERIC, PsychInfo, Academic Search Elite, Books in Print, and Dissertation Abstracts The websites of assessment
organizations (e.g., Harcourt, Measured Progress, Buros Institute for Assessment Consultation and Outreach), technical assistance centers (e.g., National Center on
Trang 8Educational Outcomes), educational organizations (e.g., Council of Chief State School Officers, National Center for Research on Evaluation, Standards, and Student Testing [CRESST]), and state education agencies were also searched for
nonpublished alignment material As some websites identified a very large number
of potential hits (e.g., Google identified 5,690,000 hits for alignment and
assessment), the first 150 of those documents were reviewed for potential inclusion.The reference lists of identified books and several seminal and recent works (e.g., Bhola, Impara, & Buckendahl, 2003: Case, Jorgensen, & Zucker, 2004; La Marca et al., 2000; Webb, 1997) were also searched Contacts with authors were made when identified materials could not be located Finally, a follow up list of prominent
authors (e.g., Andrew Porter, Robert Rothman, John Smithson, Norman Webb) and model names (e.g., Surveys of Enacted Curriculum, Achieve, Council for Basic
Education) were also searched in Google to ensure complete coverage of the
reference material
Conceptual relevance of each source identified in the literature search was determined by the study coordinator, who applied the inclusion criteria liberally during the first round of literature identification Resources that were of
questionable relevance were reviewed by a second author
Coding Procedures
Initial coding was done on the entire set of identified documents in order to broadly identify the nature of the alignment literature identified A secondary codingscheme was applied to the empirical resources
Initial coding procedures Identified material was entered into a database by
reference and was coded according to three categories: (a) elements being aligned (as described above), (b) type of document, and (c) purpose or focus of document
Trang 9The type of document was defined by five categories Literature was coded as a report if it was written as a non-published paper, technical report, dissertation, or brief Presentations included all papers or multimedia work presented to an
audience Journal articles were published works found in a journal or newsletter format Books included any chapters in edited works or manuals disseminated by states Finally, other included all training materials, web pages dedicated to
alignment, and other relevant alignment work (e.g., state documents that discussedalignment but did not include any empirical data or methodological descriptions)
The purpose or focus of the document was coded into six groups Conceptual
included literature that either defined alignment, discussed the relationships amongstandards, assessment, and curriculum, discussed reasons for alignment, or argued the benefits of well-aligned systems or drawbacks of poorly aligned systems
Resources that described a model or method to conduct alignment studies were
coded as methodological Literature that focused on recommendations for policy about alignment was coded as policy Documents that included data collection procedures and results from an original alignment study were coded as empirical Review/synthesis was coded for materials that described more than one primary source for alignment Finally, other was coded for miscellaneous foci that did not fit
other categories (e.g., state descriptions of alignment without methodological or empirical components; instances where rubrics or test blueprints were used to examine alignment)
Interrater reliability was obtained for each coded category Two researchers coded a sample of 80 documents (41%) to obtain inter-rater reliability A point-by-point method (the number of agreements for occurrences and non-occurrences divided by the total of points multiplied by 100) was used to calculate the reliability
Trang 10The average reliability for type of alignment was determined to be 88% (range of 50%-100% agreement) As there were only two documents that were identified as addressing alignment between standards and curriculum/instruction, the reliability percentage of 50% reflects one disagreement The median agreement was 91% The average reliability for the type of document was 100% The average reliability for purpose or focus of document was 90% (range of 50%-100%) Policy was
identified as the focus of six documents by one researcher and by three for the other researcher resulting in a 50% agreement rate The median was 96%
Consensus was found for any disagreements across all categories
Secondary Coding Procedures Using a coding form developed by the first
author, two researchers summarized information about the resources identified in the first phase as empirical studies Categorical data were recorded for type of literature; content area(s) and grade levels; elements of the educational system aligned; descriptions of the types of standards assessment, and instructional
indicators; alignment methodology used; and entity that conducted the alignment study The second author coded three resources with a second coder for training purposes, and then both people coded three additional resources and compared codes before the second coder coded the remaining empirical studies
independently Reliability on the secondary coding was 93% based on a sample of
11 resources (16% of the empirical literature) One researcher entered data into SPSS and cleaned the database prior to analysis
Data Analysis Strategies
Descriptive statistics were calculated on all primary codes for the entire set ofliterature, and on the secondary codes for the subset of empirical literature
Frequencies were also calculated for key characteristics of alignment studies, by
Trang 11alignment methodology Narrative descriptions of some articles were provided to illustrate certain points about the literature.
Results
In this section, characteristics of the identified alignment literature are first described Then a subset of the literature (methodological and empirical) is
analyzed to address the following points:
1. A comparison of features of alignment models and their methodologies, and
2. A narrative and quantitative analysis of results from empirical alignment studies
Characteristics of the Literature
A total of 195 resources were identified during the search, nearly half of which were reports (47%) Other documents included journal articles (21%),
presentation slides or papers (14%), and books and other resources (18%) Roughly one-third of the resources (33.4%) were empirical, while 13.8% were conceptual, 13.8% were methodological, 9.2% were descriptive or reviews, and 4.6% had a policy emphasis Twenty-four percent of resources had other foci, such as state descriptions or webpages about alignment issues (n=8), state reports described above (n=25), and documents (state and individual authors) for professional
development (Each resource could have more than one emphasis.)
One hundred fifty-seven of the resources had identifiable publication dates Six of those resources (3.8%) were published between 1984 and 1990, while
another six (3.8%) were published between 1991 and 1995 Between 1996 and
2000, 25 additional resources (15.9%) were published The number of resources published per year began increasing dramatically after 2000; between 2001 and
2005, 120 alignment resources (76.4%) were published Empirical studies were the
Trang 12most frequently identified category across all year spans, increasing from 4 in
1984-1990 to 43 in 2001-05
A diverse range of documents were identified and included in the review Research monographs and conceptual articles frequently cited in the alignment literature (e.g., Bhola, Impara, & Buckendahl, 2003; Webb, 1997) were one commontype Alignment reports, such as those published by organizations conducting
alignment studies (cf Achieve, 2001), were another common type Powerpoint presentations from conferences or meetings (e.g., Potter, 2002), books and book chapters (English & Steffy, 2001), ERIC documents (e.g., Madfes & Muench, 1999), and dissertations (Moahi, 2004) were also identified There were several types of documents published by states One type was state websites that described the definition of alignment (e.g., Florida Department of Education, n.d.) or could be used
as a resource to teachers or districts (e.g., Oregon Department of Education, n.d.) States also published test blueprints (e.g., Oklahoma Department of Education, n.d.)
or reports for peer review (North Dakota Department of Public Instruction, 2003)
Alignment Models and Methods
Bhola et al (2003) reviewed existing alignment methodologies and
characterized them according to their level of complexity Expert review of content represented in state standards and on assessments to identify item-level matches would be described as a low complexity method At the other end of the spectrum isWebb’s (1999) approach, which includes several indicators of alignment at both the item and test level The three moderate and high complexity methods used in the empirical literature are briefly described here
Achieve In the Achieve model, the four dimensions for examining the degree
of alignment between and assessment and standards are (a) content centrality, (b)
Trang 13performance centrality, (c) challenge, and (d) balance and range (Resnick,
Rothman, Slattery, & Vranek, 2003) Content centrality examines the quality of the
match between the content of each test question and the content of the related standards After a senior reviewer has matched test items to the test blueprint, reviewers examine each item (in the blueprint) to determine whether it assesses theacademic content well, partially, or not at all These judgments go deeper than the
one-to-one correspondence used in the blueprint Performance centrality focuses on
the degree of the match between the type of performance (cognitive demand) presented by each test item and the type of performance (e.g., select, identify, compare, analyze, represent, use) described by the related standard Reviewers analyze each test item to determine whether the type of performance the item requires match the demand expected by the standard, and whether it does so well,
partially, or not at all The criterion called challenge is applied to a set of items to
determine whether doing well on the set requires students to master challenging subject matter Reviewers consider two factors in evaluating sets of test items against change criterion: source of challenge and level of challenge Source of challenge attempts to uncover whether the individual test items in a set are difficultbecause of the knowledge and skills they target, or for other reasons not related to the subject matter, such as relying unfairly on students’ background knowledge Level of challenge compares the emphasis of performance required by a set of items to emphasis of performance described by the related standards Reviewers also judge whether the set of items has a span of difficulty appropriate for students
at a given grade Finally, tests must cover the full range of standards with an
appropriate balance of emphasis across the standards Evaluating balance and
range provides both qualitative and quantitative descriptive information about the
Trang 14choices that test developers have made Balance investigates whether there are enough items to measure a content strand If so, do the items in a set focus only on
a subset Range is a measure of coverage or breadth (the numerical proportion of all content addressed)
Surveys of Enacted Curriculum (SEC) Model The SEC alignment approach
analyzes standards, assessments, and instruction using a common content matrix, which consists of two dimensions for categorizing subject content, which include content topics and cognitive demands (Porter, 2002) Using this approach, content matrixes for standards, assessments, and instruction are created and the
relationships between these matrices are examined In addition to alignment
statistics that can be calculated from the two-dimensional matrix, content maps andgraphs can be produced to visually illustrate differences and similarities between standards, assessments, and instruction In practice there are usually five or more content areas and six or more categories for cognitive demand upon which
alignment is analyzed To analyze assessments and standards, a panel of content experts conducts a content analysis and codes the assessment and/or standards by topic and cognitive demand Results from the panel are then placed in a topic by cognitive demand matrix, with values in the cells representing the proportion of the overall content description While expert judgment is used to collect information from academic standards and assessments for the two dimension matrices, teacher surveys are typically used to collect data for the content of instruction Content of instruction is described at the intersection between topics and cognitive demand Teachers are surveyed on the amount of time devoted to each topic, and the
relative emphasis given to student expectations These survey data are then
Trang 15transformed into proportion of total instructional time spent for each cell in the dimensional matrix
two-Webb Webb’s (1997, 1999) alignment model includes several indicators of alignment at the item and test level Categorical concurrence is the consistency of
categories of content in the standards and assessments The criterion of categoricalconcurrence between standards and assessment is met if the same or consistent categories of content appear in both the assessment and the standards For
example, if a content standard (or stand) is measurement in mathematics does the assessment have items that target measurement? It is possible for an assessment
item to align to more than one content standard For example, if an assessment item requires students to calculate surface area, which is aligned to the content
standard of measurement, to successfully answer the question the student needs to
be able to multiply numbers, which is aligned to the content standard of operations.
In this case the item is aligned to both content standards The Range-of-knowledge
correspondence criterion examines the alignment of assessment items to the
multiple objectives within the content standards Range-of-knowledge
correspondence is used to judge whether a comparable span of knowledge
expected of students by a standard is the same as, or corresponds to, the span of knowledge that students need in order to correctly answer assessment items The range-of-knowledge numeric value is the percentage of content standards with at least 50% of the objectives having one or more hits For example, if there are five objectives (e.g., length, area, volume, telling time, and mass) included in the
content standard of measurement, a minimum expectation is at least one
assessment item is related to at least three of the objectives The balance of
representation criterion is used to indicate the extent to which items are evenly