1. Trang chủ
  2. » Tất cả

A validity framework for the use and development of exported assessments

37 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 37
Dung lượng 1,48 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • Component 1: Defining the Domain (12)
  • Component 2: Evaluation (18)
  • Component 3: Generalization (23)
  • Component 4: Explanation (25)
  • Component 5: Extrapolation (27)
  • Component 6: Utilization (30)

Nội dung

A Validity Framework for the Use and Development of Exported Assessments A Validity Framework for the Use and Development of Exported Assessments By María Elena Oliveri, René Lawless, and John W Young[.]

Defining the Domain

Defining the domain of interest is the first crucial step in the framework for new assessment development, involving the identification of targeted constructs and potential issues that may impact measurement accuracy This process ensures that test takers’ performances yield valid evidence of the assessed constructs while minimizing sources of construct-irrelevant variance, as illustrated in Figure 2.

Figure 2 Toulmin diagram for domain definition

The validity of assessments depends on ensuring that test items are construct-relevant and free from construct-irrelevant factors that can bias results Tests with high cultural loads may inadvertently assess test takers' level of acculturation or cultural knowledge rather than the intended construct, threatening test fairness and validity To maintain validity across diverse populations, test developers should minimize task demands unrelated to the construct by carefully reviewing and refining test items early in the development process This approach helps ensure that inferences drawn from test scores accurately reflect the targeted construct and are valid across multiple groups By considering different aspects of validity and potential sources of construct-irrelevant variance, we can enhance the interpretability and fairness of assessment scores for diverse populations.

We recommend using an evidence-centered design (ECD) approach to clearly define the key knowledge, skills, and abilities (KSAs) to be assessed ECD helps address existing test design issues and maps out how test takers’ performances support inferences about the measured construct It also aids in identifying potential sources of construct-irrelevant variance, especially when tests are administered to diverse populations For example, introducing novel item types like complex multiple-choice or fill-in-the-blank questions may cause differential performance due to factors unrelated to the construct being measured.

Linguistic differences, such as variations in idioms and dialects, can significantly impact test fairness even among native speakers of the same language For example, English speakers from different countries may encounter unfamiliar expressions like the British idiom "on the blower," which is foreign to American examinees, or the American idiom "wet blanket," which may be unfamiliar to British test-takers These dialectical variations can disadvantage specific populations during test development, highlighting the importance of considering regional language differences to ensure fairness and equity for all test-takers.

Various standards and guidelines emphasize the importance of accounting for linguistic and cultural differences in assessments The ITC guidelines (2013) recommend that test developers ensure tests consider the diverse backgrounds of test-takers, promoting fairness across multiple populations Similarly, the Guidelines for the Assessment of English Language Learners (Pitoniak et al., 2009) advise using widely accessible vocabulary and avoiding colloquial, idiomatic, and polysemous words that could introduce construct-irrelevant variance Additionally, evaluating specialized vocabulary to prevent regional or cultural biases is essential, especially when referencing concepts like region-specific sports such as ice hockey, which may not be familiar worldwide.

Relying on geocultural contexts in test item development can pose challenges, especially when references are region-specific, such as U.S.- or British-centric locations unfamiliar to international test-takers For example, certain driving terms like "rotaries" may be well-known in one region, while others are more familiar with the term "circles." Additionally, using references to specific monetary systems can create confusion for test-takers from different countries, highlighting the importance of culturally neutral content to ensure fair assessment.

3 The idiom on the blower means to be on the telephone.

The idiom "wet blanket" describes someone who dampens enthusiasm and tries to spoil others' enjoyment When it comes to currencies, they vary significantly across different countries, impacting international transactions Relying on geocentric content can make exporting tests more challenging, especially if the material is too country-specific, such as references to the 50 U.S states or Canadian provinces This is why the ETS International Principles emphasize the importance of creating globally relevant content to facilitate easier international communication and assessment.

The Fairness Review of Assessments (ETS, 2009) emphasizes the importance of adapting item context to the cultural background of the countries where assessments are administered To ensure cultural fairness, conducting a comprehensive domain analysis with expert teams is recommended This approach helps identify and modify assessment items to be culturally appropriate, promoting fairness across diverse populations.

Evaluating whether items covering the assessed domain accurately represent the examined construct is essential for test validity Additionally, it is important to ensure that these items are equally accessible to diverse populations taking the assessment, addressing potential issues of fairness and inclusivity The issue of accessibility in assessment design is thoroughly discussed in the joint guidelines by AERA, APA, and NCME (2014, pp 52–53), emphasizing the need for equitable testing practices.

Expert-driven domain analysis is essential for ensuring that assessments contain only domain-specific information, thereby maintaining validity Without expert involvement, claims, warrants, and grounds may be compromised, leading to weakened conclusions and unreliable test scores Accurate domain analysis guarantees that assessment results accurately reflect the target construct across diverse populations, enhancing the overall validity and usefulness of the evaluation.

Conducting expert reviews is a crucial step to ensure test items, instructions, and stimuli are free from construct-irrelevant information, thereby accurately assessing the intended construct across diverse populations These reviews, which can be done through group discussions or individual assessments, help identify linguistic or cultural biases that may disadvantage certain test-takers Specifically, experts evaluate whether the language used is unnecessarily complex, contextually unfamiliar, or unrelated to the construct, ensuring items are appropriate for all populations The feedback from these reviews should be shared with test developers to eliminate sources of construct-irrelevant variance and enhance the fairness and validity of the assessment.

Expert review of assessments should be meticulously planned by selecting specialists well-versed in both the original and target populations, including their language nuances and cultural contexts, to ensure fairness and validity Preparation of materials highlighting potentially problematic language helps reviewers identify and modify such items, ideally through collaborative review by at least two experts to prevent cultural or linguistic biases The underlying assumption of test validity is that correct responses indicate true understanding, whereas incorrect answers should not be due to language or cultural differences; issues may arise if test language or dialect varies significantly from that of the test takers or their cultural background, underscoring the importance of careful adaptation and review processes.

To effectively understand an individual's performance on a specific task, it is essential to consider the cultural influences shaping their behavior Failing to account for culturally based behaviors can lead to mistaking normal differences for serious deficits, which underscores the importance of culturally sensitive assessment.

Contextual differences in test items can significantly impact response accuracy; for example, a math problem related to curling may be unfamiliar and confusing to students in Caribbean countries but familiar to those in northern regions Such unfamiliar contexts can lead to construct-irrelevant errors, not because students lack mathematical skills but because the context distracts their focus Using culturally familiar contexts is essential to ensure that responses genuinely reflect students’ math abilities rather than their familiarity with the scenario Therefore, interpreting test results requires considering potential construct-irrelevant variance caused by context differences, to accurately assess students' true skills.

Expert reviewers assess differences in the target population’s familiarity with specific item types, test-taking strategies, and the assessment structure, such as item sequencing by difficulty To address potential unfamiliarity, providing comprehensive test information and practice materials can effectively help individuals prepare and improve their performance.

Evaluation

The next step in the framework is to evaluate whether the test scores are plausible and appropriate for their intended purpose Key considerations include whether the scoring rubrics accurately capture the construct of interest across diverse populations, enabling valid score-based inferences It is essential to ensure that scoring criteria focus on the targeted construct rather than language mechanics or unrelated response features Additionally, scoring rubrics should provide raters with clear guidance to assess the intended constructs fairly for all test-takers The development of scoring notes for constructed-response items is important to prevent raters from penalizing responses for irrelevant issues, such as grammar, if not related to the construct Finally, the alignment of scoring use with the test's predetermined purpose must be carefully assessed to ensure valid interpretation of results.

To ensure test validity across diverse populations, scores must have comparable meanings, meaning that claims about scores should be consistent regardless of the group Evaluating the assumptions of score comparability is essential to prevent potential biases that could unfairly disadvantage certain test-taker populations For instance, a scoring rubric emphasizing “succinct writing” might disadvantage individuals from cultures where directness is considered impolite, highlighting the importance of culturally sensitive assessment criteria.

Figure 3 Toulmin diagram for evaluation

Evaluating the validity of tests administered to multiple populations involves four key steps: conducting field tests, analyzing field test data for differential item functioning (DIF), performing cognitive interviews, and consulting experts to assess the fairness of the rubric DIF occurs when individuals with the same level of the underlying ability do not have an equal chance of answering a test item correctly, which may indicate potential bias caused by construct-irrelevant factors Early detection of DIF is crucial, as it allows for review and revision of biased items before the test is officially administered, ensuring fairness and accuracy in assessment outcomes.

Field testing is essential to determine whether a construct is equivalent across different groups and to confirm a shared understanding of the construct among test-takers Conducting field tests with the relevant populations ensures the validity and reliability of the measurement Several guidelines and best practices exist to effectively carry out field tests, helping to refine assessment tools and enhance their accuracy across diverse groups.

According to Pitoniak et al (2009), conducting small-scale pilot tests with a representative sample of test takers is essential for assessing English Language Learners These pilot tests help evaluate the appropriateness of test items, ensure content accuracy and fairness, and determine optimal timing for different item types Additionally, pilot testing allows for the assessment of test instruction clarity Following the field testing, interviews with participants can provide valuable insights that may not be apparent during data analysis, enhancing the overall test validity and reliability.

Analyzing field test results is a crucial step in test development, especially when administering assessments to diverse populations Field testing helps identify items influenced by cultural or linguistic differences, such as wording that may hinder understanding These insights inform decisions to retain, modify, or discard test items, ensuring fairness and accuracy Additionally, field test data support the evaluation of how well items perform in linking different test forms through equating or linking procedures Differential Item Functioning (DIF) analyses, conducted on sufficiently large samples (approximately 300 examinees per group), help determine whether test items function equivalently across diverse populations, enhancing the test's fairness and validity.

Cognitive interviews, also known as think-aloud studies, are essential for identifying hidden assumptions and alternative interpretations of test scores, enhancing our understanding of differential item functioning across diverse populations They reveal differences in thought processes, uncover sources of confusion, and highlight how various populations interpret item types and formats, which can impact test performance Additionally, cognitive interviews assess whether all test takers use similar strategies when responding to items, providing valuable insights into cognitive processes Conducted on a representative subset of test-takers, these interviews offer a cost-effective method for collecting validity evidence and examining potential differences with small sample sizes, thereby improving test fairness and accuracy.

Evaluating the fairness of rubrics is a crucial step in establishing test validity, ensuring that scoring criteria, testing conditions, and statistical properties accurately reflect test takers’ knowledge of the assessed construct Expert panels play a vital role by developing, reviewing, and calibrating scoring procedures to ensure consistency and fairness across diverse populations This calibration is essential for enabling score comparability, which allows valid inferences about the targeted constructs across different contexts and tasks Additionally, psychometricians conduct reliability studies to verify scoring consistency, further supporting the fairness and validity of the assessment results.

When using assessments with new populations, it is crucial to evaluate scoring rubrics to ensure they clearly define the key elements of correct responses, enabling valid scores (Kane, 2013) The appropriateness of the rubric must be assessed to prevent disadvantaging any test-taker group, ideally during early test review stages when new populations are introduced This approach helps eliminate potential sources of construct-irrelevant variance Additionally, the IUA should undergo evaluation by neutral or critical parties during expert review panels to ensure fairness and validity of test content.

According to the ITC (2013) guidelines, test developers should provide evidence that test instructions and rubrics are appropriate for all intended test-taker populations Effective scorer training involves using benchmark papers that reflect diverse population characteristics to ensure accurate application of rubrics; including responses from various groups in training materials helps familiarize scorers with potential response variations Regular recalibration at the start of each scoring session is essential to maintain scoring accuracy, with practices such as examining score discrepancies across populations and conducting back-reading checks by a chief reader or table leader Using the same raters and rubrics across different populations can minimize score variance due to construct-irrelevant factors, though calibration challenges may arise when introducing new raters Supplementing rubrics with scoring notes that clarify cultural or linguistic nuances—like expressive versus concise writing styles—can further enhance scoring fairness and validity.

Generalization

After thoroughly reviewing and redefining the assessment domain to suit the new population, it is essential to evaluate whether the test's rubrics and overall use remain appropriate The next crucial step involves examining the reliability of student performance across parallel test forms, ensuring that, following necessary modifications for export, the test items continue to represent a random and valid sample of the targeted domain, thereby maintaining assessment validity (Kane, 2013).

Evaluating the generalizability of assessment scores ensures that test results accurately reflect test takers’ knowledge, skills, and abilities across various testing conditions This process involves verifying that the task configuration aligns with the intended score interpretations and includes a sufficient number of tasks to represent the construct comprehensively Ultimately, scores should be consistent across different testing instances, such as different forms, sites, administrations, and raters, ensuring fair and reliable measurement of performance.

Inadequate rater training can limit the comparability of test results across diverse populations, as less familiarity with specific writing styles may lead to unfair scoring and lower scores for certain test-takers Ensuring consistent and reliable scoring is essential for maintaining test validity; this involves refining scoring rubrics and providing targeted rater training to address observed inconsistencies During pretesting or item tryouts, it is crucial to sample conditions across multiple populations by including test takers from all relevant groups, thereby enhancing the generalizability of the assessment Additionally, the item universe should be free of bias to ensure fair representation and accurate evaluation across diverse populations.

Figure 4 Toulmin diagram for generalization.

Explanation

During the explanation stage, the focus is on ensuring that variations in cognitive processes and KSAs are only related to differences in individuals' knowledge or skills of the assessed construct A high test score indicates that the test taker possesses high levels of KSAs aligned with the assessed construct, consistent with theoretical expectations When a measure is free from linguistic or cultural biases and the scoring rubrics and mechanisms function effectively across different populations, it is reasonable to infer that new populations will demonstrate similar KSAs on the assessed construct The Toulmin diagram illustrating this explanation is shown in Figure 5.

Evaluating the correlation between test scores and other measures is essential for validating assessment effectiveness When using exported assessments, it’s important to compare scores with localized measures of the same construct to ensure consistency However, differences in test quality can affect correlations, so comparable quality between localized and exported assessments allows for meaningful comparisons, such as between_EXADEP and local admissions tests in Spanish-speaking countries This process involves examining how well EXADEP scores relate to assessments administered locally—either by local universities in decentralized education systems or national exams in centralized systems Additionally, interpreting low test scores requires caution, as factors like how expertise is developed and demonstrated can impact the accuracy of inferences about test takers’ knowledge, skills, and abilities (KSAs).

According to 2010, expertise is developed and demonstrated through the recognition of patterns, both in the physical and social worlds These pattern recognitions significantly influence how individuals learn, highlighting the importance of understanding diverse types of knowledge in skill development.

Figure 5 Toulmin diagram for explanation

Understanding how we perceive, discuss, and utilize domain-specific tools influences assessment strategies and knowledge expression Acquiring expertise involves leveraging natural strengths to address unfamiliar situations, a principle that can be extended to assessments, highlighting the importance of familiar and culturally relevant task design When assessment tasks are unfamiliar to certain populations, it can cause them to struggle with demonstrating their knowledge, not due to lack of understanding but because of unfamiliar testing formats To address this, developing targeted test preparation materials that familiarize test takers with assessment requirements is essential Employing a universal approach—using broadly applicable scenarios and minor adaptations, such as culturally relevant units of measurement—can improve test-takers’ ability to showcase their skills For instance, replacing miles with kilometers in math items helps remove unnecessary cognitive barriers, allowing individuals to focus on demonstrating their core knowledge and skills.

Extrapolation

Extrapolation is based on extending the interpretation of test scores to reflect a test taker’s broader proficiency across an entire domain, rather than just the specific test Unlike generalization, which compares results across different test forms and populations, extrapolation involves applying scores to the underlying constructs and skills within the domain This means that a test taker’s performance indicates their ability to perform any task testing the same constructs in the same domain, assuming their cognitive processes are consistent across tasks Proper interpretation of scores allows for inferring a test taker’s knowledge, skills, and abilities (KSAs) throughout the entire domain However, using assessments that are not culturally or linguistically adapted for new populations can compromise validity, as tests developed for one context—such as U.S higher education admissions—may not be appropriate for other regions like Asia, which may emphasize different skills or curricula.

Figure 6 highlights the importance of considering curricular differences when extrapolating test results across populations It is essential to assess whether the tasks in the exported assessment accurately reflect a representative sample of the domain's content Determining if parallel forms are equally representative ensures that test users can trust that score interpretations genuinely reflect test takers’ knowledge of the assessed domain.

When analyzing test scores, it is essential to ensure that the assessments are not artificially speeded for test takers, which could compromise the validity of the results Speededness occurs when time limits prevent a significant number of examinees from thoroughly engaging with all test items, potentially measuring their speed rather than the intended ability To maintain fairness and accuracy, test scores should primarily reflect genuine differences in knowledge, skills, and abilities related to the construct being assessed, rather than test-taking speed.

Figure 6 Toulmin diagram for extrapolation.

Utilization

In a validity argument framework, utilization refers to how the scores from a test will be used

Bachman (2005) emphasizes four key warrants for assessment utilization: firstly, that score interpretations are relevant to decision-making and directly aligned with the targeted language constructs and assessed KSAs; secondly, that these interpretations are supported by a high probability of making correct decisions; thirdly, that the outcomes of using the assessment are beneficial for test takers, test users, and society; and lastly, that the assessment provides sufficient information about test takers' skills to enable appropriate decision-making Furthermore, effective utilization has significant implications when applying assessments to new populations, as clear evidence is essential to ensure validity and prevent ambiguities that could undermine the assessment's credibility in different contexts.

Figure 7 presents Toulmin arguments for utilization, emphasizing that scores can generate valid inferences across multiple populations The warrant asserts that score-based inferences are valid if the assessment results are used for their intended purpose Backing statements confirm that the IUA (Interpretation and Use Argument) is explicitly defined, ensuring that scores obtained from the assessment are appropriate for all targeted populations Additionally, review processes and comparability studies have been conducted to verify that the assessment items are suitable for new populations and that inferences drawn are consistent across all groups.

In the United States, aspiring Certified Public Accountants (CPAs) must pass the Uniform CPA Examination to obtain licensure, enabling them to work in diverse finance fields such as financial planning, accounting, and tax preparation However, the applicability of this certification may be limited when considering other countries like the United Kingdom or France, where different licensing regulations and professional standards govern similar financial professions.

Figure 7 Toulmin diagram for utilization

We developed this framework to address potential threats to validity when using exported assessments across different populations, emphasizing the importance of thorough validation procedures for high-stakes decisions Key steps include expert review of the content domain, conducting field tests, and psychometric analyses such as DIF and measurement invariance to ensure that parallel test forms produce comparable scores It is also crucial to empirically evaluate the underlying assumptions of psychometric models—like the uniform use of knowledge, skills, and abilities (KSAs)—since these may be violated when assessments are administered to diverse groups Additionally, creating test preparation materials with clear examples of all item types promotes equal access, helping test-takers familiarize themselves with the content and format before testing, thereby enhancing fairness and validity.

Implementing all six components of the assessment framework can be challenging, especially for tests developed before these guidelines, requiring gradual modifications High costs associated with hiring experts for reanalysis and item creation or modification after pilot testing also pose significant limitations Additionally, recruiting sufficient students from diverse populations for field testing is often difficult, which hampers the identification of cultural or linguistic biases affecting test validity Despite these challenges, field testing remains crucial for verifying that assessments do not favor speed in certain populations and for detecting comprehension difficulties among different groups Omitting field testing across diverse populations can lead to issues with score comparability and validity, raising concerns about whether test scores accurately reflect the intended construct across different groups.

Cultural differences between the original and new populations can impact assessment effectiveness, as a one-size-fits-all approach is not suitable Conducting cognitive interviews with a small sample can help identify culturally or pedagogically problematic items specific to the new population We recommend evaluating these differences to enhance the validity of the Indigenous Use Assessment (IUA).

Our framework was inspired by ITC guidelines (ITC, 2005, 2013) and the Standards for Educational and

Psychological test developers should play a crucial role in reviewing assessments intended for export, as they possess essential knowledge of the tested constructs Collaboration with cultural experts helps identify culturally or linguistically loaded items before administration, enhancing test fairness Engaging psychometricians prior to testing is vital for interpreting score differences, invariance levels, and score scale variations across populations revealed during field trials Including researchers in the process ensures that findings from pretesting are validated, allowing for the recognition of population differences and supporting the fair, valid use of exported assessments for accurate score-based inferences.

Altbach, P G., Reisberg, L., & Rumbley, L E (2009) Trends in global higher education: Tracking an academic revolution Report prepared for the UNESCO 2009 World Conference on Higher Education Paris, France:

American Educational Research Association, American Psychological Association, & National Council on

Measurement in Education (2014) Standards for educational and psychological testing Washington DC: American Educational Research Association

Bachman, L F (2005) Building and supporting a case for test use Language Assessment Quarterly, 2(1), 1–34 doi: 10.1207/s15434311laq0201_1

Bracken and Barona (1991) highlight the importance of rigorous procedures for translating and validating psychoeducational tests to ensure their effectiveness in cross-cultural assessment They emphasize that accurate translation methods are vital for maintaining the content validity and cultural relevance of assessment tools Additionally, Brislin (1980) discusses comprehensive techniques for translating and analyzing both oral and written materials, underscoring the necessity of meticulous content analysis to ensure linguistic and cultural equivalence These strategies are essential for producing reliable and valid assessments across diverse cultural contexts, thereby enhancing the fairness and accuracy of psychological testing globally.

Berry (Eds.), Handbook of cross-cultural psychology (pp 389–444) Boston, MA: Allyn & Bacon

Brislin, R W (1986) The wording and translation of research instruments In W J Lonner & J W Berry

(Eds.), Field methods in cross-cultural research (pp 137–164) Thousand Oaks, CA: Sage

Chapelle, C A (2008) The TOEFL validity argument In C A Chapelle, M K Enright, & J M Jamieson

(Eds.), Building a validity argument for the Test of English as a Foreign Language (pp 319–352) New York, NY: Routledge

Chapelle, C A., Enright, M K., & Jamieson, J (2010) Does an argument-based approach to validity make a difference? Educational Measurement: Issues and Practice, 29(1) 3–13

Clauser, B E (2000) Recurrent issues and recent advances in scoring performance assessments Applied

Educational Testing Service (2009) ETS international principles for fairness review of assessments: A manual for developing locally appropriate fairness review guidelines in various countries Princeton, NJ: Author

Educational Testing Service (2013) Boletín de Información e Instrucciones del examen EXADEP Princeton, NJ:

Educational Testing Service (2015) ETS standards for quality and fairness Princeton, NJ: Author

Ercikan, K., Arim, R G., Law, D M., Lacroix, S., Gagnon, F., & Domene, J F (2010) Application of think- aloud protocols in examining sources of differential item functioning Educational Measurement: Issues and Practice, 29(2), 24-35

Ercikan, K., & Oliveri, M E (2013) Is fairness research doing justice? A modest proposal for an alternative validation approach in differential item functioning (DIF) investigations In M Chatterji (Ed.),

Validity and test use: An international dialogue on educational assessment, accountability, and equity (pp 69–86)

Geisinger, K F (1994) Cross-cultural normative assessment: Translation and adaptation issues influencing the normative interpretation of assessment instruments Psychological Assessment, 6, 304–312

Greenfield, P M (1997) You can’t take it with you: Why ability assessments don’t cross cultures American

Hambleton, R K (1994) Guidelines for adapting educational and psychological tests: A progress report

European Journal of Psychological Assessment, 10, 229–244

Hambleton, R K., Merenda, P., & Spielberger, C (Eds.) (2005) Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures Mahwah, NJ: Erlbaum

Helms, J E (1997) The triple quandary of race, culture, and social class in standardized cognitive ability testing In D P Flanagan, J L Genshaft, & P L Harrison (Eds.), Contemporary intellectual assessment (pp 517–532) New York, NY: Guilford Press

International Test Commission (2005) International Test Commission guidelines for translating and adapting tests

Retrieved from http://www.intestcom.org/files/guideline_test_adaptation.pdf

International Test Commission (2013) ITC guidelines on test use Retrieved from http://www.intestcom.org/files/guideline_test_use.pdf

Kane, M T (2013) Validating the interpretations and uses of test scores Journal of Educational Measurement, 50, 1–73

Laing, S P.,& Kami, A (2003) Alternative assessment of language and literacy in culturally and linguistically diverse populations Language, Speech, and Hearing Services in Schools, 34, 44–55

Mislevy, R J (2010) Some implications of expertise research for educational assessment Research Papers in

Mislevy, R J., & Haertel, G D (2006) Implications of evidence-centered design for educational testing

Educational Measurement: Issues and Practice, 25(4), 6–20 doi: 10.1111/j.1745-3992.2006.00075.x

Mislevy, R J., Steinberg, L S., & Almond, R G (1999) On the roles of task model variables in assessment design Los

Angeles, CA: National Center for Research on Evaluation , Standards, and Student Testing

Oliveri, M E., Ercikan, K., & Zumbo, B D (2014) Effects of population heterogeneity on accuracy of DIF detection Applied Measurement in Education 27, 286-300

Oliveri, M E., Gundersen-Bryden, B., & Ercikan, K (2012) Scoring issues in large-scale assessments In M

Simon, K Ercikan, & M Rousseau (Eds.), Improving large-scale assessment in education: Theory, issues and practice (pp 143–153) New York, NY: Routledge/Taylor & Francis

Pitoniak, M J., Young, J W., Martiniello, M., King, T C., Buteux, A., & Ginsburgh, M (2009) Guidelines for the assessment of English language learners Princeton, NJ: Educational Testing Service

Rhodes, R L., Ochoa, S H., & Ortiz, S O (2005) Assessing culturally and linguistically diverse students: A practial guide New York, NY: Guilford Press

Toulmin, S (1958) The uses of argument Cambridge, UK: Cambridge University Press van de Vijver, F J R., & Hambleton, R K (1996) Translating tests: Some practical guidelines European

Psychologist, 1, 89–99 van de Vijver, F J R., & Leung, K (1997a) Methods and data analysis for cross-cultural research Thousand Oaks,

CA: Sage van de Vijver, F J R., & Leung, K (1997b) Methods and data analysis of comparative research In J W

Berry, Y H Poortinga, & J Pandey (Eds.), Handbook of cross-cultural psychology (Vol 1, 2 nd ed., pp 257–

Ngày đăng: 23/11/2022, 18:54

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w