Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performance
Trang 1This dissertation was completed at the University of Languages and
International Studies, Vietnam National University, Hanoi
This dissertation was defended on 10 May 2018
This dissertation can be found at:
- National Liberary of Vietnam
- Liberary and Information Center Vietnam National University, Hanoi
Trang 2DECLARATION OF AUTHORSHIP
I hereby certify that the thesis I am submitting is entirely my own original work except where otherwise indicated I am aware of the University's regulations concerning plagiariom, including those regulations concerning disciplinary actions that may result from plagiarism Any use of the works of any other author, in any form, is properly acknowledged at their point of use,
Date of submission:
Ph.D Candidate's Signature:
Trang 3T certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy
Prof Nguyễn Hòa
(Supervisor}
I cernfy that I have read this dissertation and that, in my opinion, it
is filly adequute in scope and quality as a dissertation for the degree
of Doctor of Philosophy
2x44
Prof Fred Davidson
(Cu-supervisor)
Trang 42 Standard setng for an knglish proñeieney test co cieeeereeeesoees— TR
3.3 Common elements in standard setting TH eee " 21 2.3.1 Selevling a standard-setting method _ - - 21
1.3.3 Preparing descriptions of performance-level descriptors - 24
2.3.6 Compiling ratings and obtain cut scores 27
iv
Trang 52.4.2 Internal evidence
2.4.3 External evidence
Comparisons to other standard-setting methods
Comparisons to other sources of information
2.4.3.3 Reasonableness of cul scores:
3 Testing listening
3.1 Commmmnicative language testing
3.3 Listening construct
4, Statistical analysis for a language test
4.1 Statistical analysis of multiple choice (MC) items
4.3 Investigating reliability of a language IosL
5 Review of validation studies
5.1 Review of validation studias on standard setting
5.2 Review of studies employing argument-based approach in validating language
ÂsfS
CHAPTER IT: METHODOLOGY
1 Context of the study
LL About the VIEP.3-.5 test
1.1.1 The development history of the VSTRP
1.1.2 The administration of the VSTEP.3-5 1.1.3 Test takets
5 test,
in Vialnam
1.1.4 Test stmneture and scoring rubrics
1.1.5 The establishment of the cut scores 1.2 About the VSTEP.3-5 listening test
1.2.1 Test purpose
1.2.3 Performance standards
1.2.4 The establishment for the cut scores of the VSTHP.3-5 listening test
2 Building an interpretive argument for the VSTRP.3-5 listening tosL
3 Methodology
Trang 63.1 Research questions eee cence - ¬ 70
3.2.1 Analysis of the test tasks and test items T2
3.2.2 Analysis of tast reliabilÏfy cà cành se tt " T5
3.3 Description af Bookmark standard soling propeduros 78
3.4.1 Test takers of early 2017 adminiswation 81 3.4.2 Participants for Dookmark standard setting method we 82
3.5.3.2, Iiemah 4L cece sess eeeeensseesssessesessieessessees 86
1.1.3 Relationship between the input and response cee 102
1.2.1 Overall statistics of item difficulty and item discrimination 102
Trang 71.2.2 Hem analysis
2 Amalysis of the test reliability
3 Analysis of the cut-scores,
3.1 Procedural evidence
3.2 Internal evidence
3.3 Extemal evidence
CHAPTER VY: FINDINGS AND DESCUSSIONS
1 The characteristics of the (ssl tasks and test iterns
2 The reliability of the VSTEP.3-5 listening test
3 The accuracy of the cut scores of the VSTEP 3-5 listening test
CHAPTER VI: CONCLUSION
1, Overview of the thesis
2 Contributions of the study.,
3, Limitations of the study
4, Implications of the study
5 Suggestions for Ruther rasearch
LIST OF TIIESIS-RELATED PUBLICATIONS
REFERENCES
APPENDIX L: Structure of the VSTEP.3-5 test
APPENDIX 2: Summary of the directness and intera
questions of the VSTEP.3-5 listening
APPENDIX 3: Consent form (workshops)
APPENDIX 4: Agenda for Bookmark standard-setting procedure
APPENDIX 5: Panclist recording form
APPENDIX 6: Evaluation form for standard-setting participanfs
APPENDIX 7: Control file for WINSTEPS
APPENDIX 8: Timeline of the VSTEP.3-5 test administration
APPENDIX 9: List of the VSTEP.3-5 developers
Trang 8LIST OF FIGURES
Figure 2.1: Model of Youlmin’s argument structure (1958, 2003)
Figure 2.2: Sources variance in test stores (Bachman, 1990)
Figure 2.3: Overview of interpretive argument for ESL writing course placements Figure 4.1: Item map of the VSTEP.3-5 listening test
Vigure 4.2: Graph for item 2
Figure 4.3: Graph for ilem 3
Figure 4.4: Graph for item 6
Figure 4.5: Graph for itema 13
Figure 4.6: Graph for item 14
Figure 4.7: Graph for item 15
Figure 4.8: Graph for item 19
Figure 4.9: Graph for item 20
Figure 4.10: Graph for item 28
Figure 4.11: Graph for item 34
Figure 4.12: Total score for Ihe seorad iterns
viii
Trang 9
LIST OF TABLES
Ҡablz 2.1: Review of standard-setting methods (Hambleton & Pitoniak, 2006)
Table 2.2: Standard selling Evaluation lements (Cizek d& Bunch, 2007)
Table 2.3: Common steps required for standard setting (Cizek & Bunch, 2007)
Table 2.4: A tiamework for defining listening task characteristics (Buck, 2001)
‘Lable 2.5: Criteria for item selection and interpretation of item difficulty index
Table 2.6: Criteria for item selection and interpretation of iter discrimination indzx
Table 2.7: General guideline for interpreting test reliability (Bachman, 2004)
‘Table 2.8: Number of proficiency levels & test reliability
Table 2.9: Summary of the warrant and assumptions associated with cach inference in the
‘TOEEL interpretive argument (Chapelle et al., 2008)
‘Table 3.1: Souclure pf the VSTEP.3-5 tesl
Table 3.2: The cut scores of the VSTEP.3-5 test
Table 3.3: Performance standard of Cverall Listening Comprehension (CEFR: leaning,
teaching, assessment)
Table 3.4: Performmme standard of Undorstanding eanversation between nalive spoakars
(CEFR: leaming, teaching, assessment)
‘Table 3.8: Performance standard of Listening as a member of a live audience (CUR
1emning, sacking, 2
ment)
Table 3.6: Performance standard of Listening to announcements and instructions (CEFR:
learning, teaching, assessment)
‘Table 3.7: Performance standard of Listening lo andio media and recordings (CEPR
Jearning, teaching, assessment)
‘Lable 3.8: ‘he cut scares of the V
Table 3.9: Criteria for item
Table 3.10: Criteria for item selection and interpretation of itam discrimination index
3P.3-5 test
ction and interpretation of stern difficulty index
‘Vable 3.11: Number of proficiency levels & test reliability
Tal
Table 3.13: Comparison botw
IELTS grading systems
12: The venue for Angafl and Bookmark standard setting method
the Flesch-Kincaid roadability analysis and the CEFR -
Trang 10‘fable 3.14: Summary of the interpretative argument for the interpretation and use of the
Table 4.1: General instruction of the VSTEF.3-5 listening test 90
Table 4.4: Instruction for Part 3 5: St tt HH He cư 9
‘table 4.5: Information provided in the specifications for the VSTEP 3-5 listening test 94
Table 4.7: Description of language levels for texts of items 1-8 in the speoiication 9
Tabls 4.9: Description of language levels for loxts of items 9 -20 in the spscification 99 Table 4.10, Summary of the texts for items 21-35 cece 100 Table 4.11: Description of language levels for texts of items 21-35 inthe specification 101 Table 4.12: Summary of item discrimination and item difficulty 104 Table 4.13: Summary statistics for the flagged items cece 106
‘Vable 4.14: Information for item 2 00.ccscescecess esters 108
TTable 4.16: Option statistics for item 2 109
Tables 4.19: Tom statistics for item 3 " "— a MG
'Table 4.21: (uantile plot data For item 3 - - - cee UL
Table 4.24: Option statistics for item 6 - - ¬
'†ablz 4.26: Information for item 1 ìccinntnnrererrrreaeerererue TES
Table 4.29: Quantile plot data for item 13.0 00.c cece HG
Trang 11: Information for item L4
Tern statistics for item 14
Option statistics for item 14
: Quantile plot data for item 14
Information for ilern ES
Item statistics for item 15
Option statistics for item 15
Quantite plot data foritem 15
Item statistics for item 19
Option statisties for iter 19
: Quantile plot data for item 19
Item statistics for item 20
Quantile plot data for item 20
Information for ilern 28
Item statistics for item 28
Option statistics for item 28
Quantile plot data for item 28
Information for ilern 34
Option statisties for item 34 Quantile plot data for item 34
Infonnation for item E9
Information for item 20
Option statistics for item 20
Item statistics for item 34
The person reliability and item reliability of the test
Number of proficiency levels and test reliability
"The test reliability of the VSTEP.3-5 listening test,
Trang 12Round 3 Feedback for Bookmark Standard-setting Procedure
Surumary of Oulpat from Round 3 of Bookrunk standard.selling Procadhae The cut scores set for the VSTEP.3-5 listening test by Bookmark method
‘The cut scores set for the VSTEP.3-5 listening test by Angoff method
Comparison betwaan the results of two standard-sotling methods
Trang 13LIST OF KEY TERMS
Construct: A construct refers to the knowledge, skill or ability that's being tested
In a more technical and specific sense, it refers to a hypothesized ability or mental
trait which cannot necessarily be directly observed or measured, for example,
listening ability Language tests attempt to measure the different constructs which underlie language ability
Cut score: A score that represents achievement of the criterion, the line between
success and failure, mastery and non-mastery
Descriptor: A brief description accompanying a band on a rating scale, which
summarizes the degree of proficioney or type of performance expected for a test taker to achieve that particular score
Distractor: ‘he incorrect options in multiple-choice items
Expert panel: A group of target language experts or subject matter experts who
provide comments about a test
High-stakes test: A high-stakes test is any test used to make important decisions
about tesl takers
Taference: A conclusion that is drawn about something based on evidence and
reasoning
Input: Input material provided in a test task for the test taker to use in order to
produce an appropriate response
Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results, Ltcm (also, test item): Hach testing point in a test which is given a separate score or
scores Examples are: one gap in a cloze test, ong multiple choice question with
xi
Trang 14three or four options, one sentence for grammatival transformation; one question to which a sentence-length response is expected
Key: ‘fhe correct option or response to a test item
Multiple-choice item: A type of test item which consists of a question or incomplete sentence (stem), with a choice of answers or ways of completing the sentence (options) ‘Ihe test taker’s task is to choose the correct option (key) fram a set of possibilities There may be any number of incorrect possibilities (distractors) Options: The range of possibilities in a multiple-choice item or matching tasks from which the correct one (key} must be selected,
Panelist: A target language expert or subject mailer experL who provides comments
of cut scores to examinee performance on a test
Performance standard: The abstract conceptualizalion of the minimum level of
performance distinguishing cxaminces who possess an acceptable level of kmowledge, skill, or ability judged necessary to be assigned to a category, or for
some olher specific purpose, and those wha do nol possess thal level This term is
sometimes used interchangeably with cut score
Proficiency test: A test which measures how much of a language someone has leamed Proficiency tests are designed to measure the language ability of examinees regardless of how, when, why or umder what circumstances they may have experienced the language
xiv
Trang 15Readability: Readability is the caso with which a reader can understand a written text, The readability of text depends on its content (the complexity of its vocabulary and syntax) and ils presentation (such as typographic aspects like font size, line
height, and fine length)
Reliability: ‘The reliability of a test is concerned with the consistency of scoring and the accuracy of the administration procedures of the test
Response probability (RP) criterion: In the context of Bookmark and similar tem-mapping standard-selling procedures, the criterion used to operationalize
participants’ judgment regarding the probability of a correct response (for dichotomously scored items) or the probability of achieving a given score point or
higher (for polylomously scored items) Tn practical applications, two PR mrilcria
appear to be used most frequently (RPSO and RP67); other PR criteria have also
been used though considerably less frequently
Rubric: A set of instructions or guidelines on an exam paper
Selected-response: An item format in which the test taker must choose the correct
answer from alternative provided
Specifications (also, test specifications): A description of the characteristics of a lest, including whal is tosted, how iis tesled, and details such as number and length
of forms, item types used
Standard setting: A measurement activity in which a procedure is applied to
syslematically gather and analyze human judgment for the purpose of deriving one
or more cut scores for a test
Standardized test: A standardized test is any form of test that (1) requires all test takers to answer the same questions, or a selection of questions from common bank
of questions, in the same way, and that (2) is scored in a “standard” or consistent manner, which makes it possible lo compare the rehilive performance of inslividual
students or groups of shidents
xv
Trang 16Test form: Test forms refer to different versions of tests that are designed in the same format and used for different administrations
Validation: An action of checking or proving the validity or accuracy of something The validity of a test can only be established through a process of validation
Validity: The degree to which a test measures what it is supposed to measure, or
can be used successfully for the purpose for which it is intended A number of different statistical procedures can be applied to a test to estimate its validity Such
procedures generally seek to delsrmine whal the fest measures, arxt how well it does
Trang 17ABSTRACT
Standard setting is an important phase in the development of an examination program, especially for a high-stakes test Standard setting studies are designed to identify reasonable cul scores and to provide backing for this choice of ont scores
‘This study was aimed at investigating the validity of the cut scores established for a VSTEP.3-5 listening test administered in early 2017 on 1562 test takers by one institution pormitted by the Ministry of Education and Training, Vietnam to design and administer the VSTLP.3-S tests ‘fhe study adopted the current argument-based validation approach with a focus on three main inferences constructing the validity argument They were (1) Lest tasks and items, (2) tosl reliabilily and (3) cul scores The argument is that in order for the cut-scores of the VSTHP.3-5 listening test to
‘be valid, the test tasks and test items first needed to be designed in accordance with
the characterislics specified in the speciltcations Sccond, the listening {esl scores should be sufficiently reliable so as to reasonably reflect test-takers’ listening
proficiency Third, the cut scores were reasonably established for the VSTEP.3-5
listening test
In this study, both qualitative and quantitative methods were combined and structured to back for and against the assumptions in each of these three inferences
With regards to the first inference and second inference, an analysis of the test tasks
and the test items was conducted whereas test reliability was investigated in order to
see if it was in the acceptable range or not In terms of the third inference about the
cut scores of the VSTEP.3-5 listening test, Bookmark standard setting method was
implemented and the results were compared with those currently applied for the test This study offers contributions in three areas First, this study supports the
widely-held notion of validily as a unitary concepl and validation is the process of
building an interpretive argument and collecting evidence in support of that
argument Second, this study contributes towards raising the awareness of the
Trang 18importance of evaluating the cut scores of the high stakes language tests in
Vietnam so that fairness can be ensured for all of the test takers Third, this study
camtribules to [he construction of a syslomatic, bansparonl and defensible body of validity argument for the VSTLP.3-5 test in general and its listening component in particular The results of this study are helpful in providing informative feedback to
the establishment, of the cut scores for the VSTEP.3-5 listerung test, Lhe Lesh
specifications, and the test development process ‘I'he positive results can provide avidence to strengthen the reasonableness of the cut scores, the specifications and
the qualily of the VSTEP.3-5 listening lest The negative resulls can give
suggestions for changes or improvement in the cut scores, the specifications and the design of the VSTEP.3-5 listening test.
Trang 19appreciate all Ins conlribulions of time, ideas and other assistance Lo make my PhD
experience productive and stimulating lls enthusiasm and encouragement were
motivational for me, making my Ph.D pursuit a short and enjoyable journey I am
also very graleful lo him [or involving me in bis various research projects, which has provided me with a lot of experience in conducting this study He has been a
tremendous mentor
T would also like to thank my co-supervisor, Fred Davidson, Professor Emeritus from the University of Ilinois, for giving mo the very first ideas, advice and guidance on how to start my Ph.D study His advice on both research as well as on
ay career has been invaluable
I am especially thankful to Professor Nathan T Carr from California State University, Fullerton for conducting a series of workshops on designing and analyzing language tests at the University of Languages and Intemational Studies, Vietnam National University - Hanoi Being able to discuss my work with him has een invaluable for developing my ideas Sharing his knowledge and experience about language testing and assessment in general and standard-setting methods in
particular have been 4 greal conlribution Lo the completion of my Ph) thesis
I want to thank all of my colleagues at the University of Languages and Intemational Studies, Vietnam National University - Hanoi, ospecially my
Trang 20colleagues at the Center for Language Testing and Assessment, for sharing my workload and always cheering me up when | was down
My sincere thanks also go to Dr Lluynh Anh ‘tuan, Dean of the faculty of the Post-
praduate Studies and his staff for helping me to process the paperwork and
constantly reminding me of the deadlines Without their support and encouragement, I would have postponed my thesis defense for one or two more
years
Words cammol express how graleful Tam to my family 1 want to say thank you 1a
any parents and siblings for their encouragement during the time I conducted my
study
‘This thesis is dedicated to my beloved husband and my daughter for their love, endless supporl, encouragement and sacrifices Uhroughoul Ibis experience
As a final word, T would like to thank cach and every individual who has been a
source of support and encouragement and helped me to achieve my goal and complete my thesis work successfully,
XK
Trang 21CHAPTER I INTRODUCTION
This chapter is Lo introduce the topic of the study and present the main reasons for choosing it After that, the chapter presents the questions that are going to be addressed within the scope of the study A brief overview of the organization of the thesis will close the chapler
4, Statement of the problem
The term “cut scores” refers to the lowest possible scores on a standardized test,
high-stakes test or other forms of assessment that help to separate a test score scale into two or mare regions, creating categories of performance or classification of
examinees Clearly, if the cul scores are nol appropriately sei, the results of the
assessment could come into question l'or this reason, establishing cut scores for a
test has been considered an important and practical aspect of standard setting In Kane’s (2006) recent discussion for test validation, besides ermphasiving the
importance of carefully defining the selected cut scores, he hightights the evaluation
of the reasonableness of the cut scores and states that the establishment of the cut
scores is a complex endeavor, but fhe validation of the cul scores is even more
difficult
According to the Standards for Educational and Psychological Testing (AERA et
al, 1999, p.9), validity is defined as “the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests” and test, validation is the process of making a case for the proposed interpretation and use of test scores This case takes the forms of an argument that states a series of
propositions supporting the proposed interpretation and usc of test scares and
summarizes the evidence supporting these propositions (Kane, 2006) With regard
to standard setting, since there are no ‘gold standards” and “te cut scores”, to
Trang 22validate cstablished cut scores means to provide evidence in support of the plausibility and appropriateness of the proposed cut score interpretation, their credibilily and delensibilily (Kane, et al., 1999} In the world, though plenty of studies have been conducted on the validity of cut scores established for a test, these studies mainly aim at cross-validating two different methods of standard setting and comparing the resulls of these methods instead of investigaiing the validily af cul
scores as a whole,
In Vietnam, the National Foreign Language 2020 Project (NFL2020) was initiated
in 2008 with the aim to “renovate the teaching and leaming, of foreign languages
wilhin the nalional education system” so that “ by 2020, most Vietnamese
students graduating from sccondary, vocational schools, colleges and universities
will be able to use a foreign language confidently in their daily communication,
their sludy and work in am integrated, multi-cultural and multi-lingual environment, making foreign languages a comparative advantage of development for Vietnamese
people in the cause of industrialization and modernization for the country”
(Decision 1400/OD-TTg) Language assessmonl is considered a major component
of this project ‘[he biggest achievement of this component is the emergence of the
first ever-standardized test of English proficiency (the VSTEP.3+5 test), The test
was officially released by the Ministry of Education and Training, Victmam on 11% March 2015 ‘Lhe test aims at measuring Linglish ability across a broad language proficiency continuum from level 3 to level 5, which is equivalent to Bl - Cl CEFR levels (Common European Framework of Reference for Languages) ‘Ihe cut scores
of the VSTLP.3-S test help to categorize test takers and certify them based on the
levels they achieve These cut scores are applied for all of the results of the
VSTEP.3-5 tosis, which are supposed Lo be strivlly built in accordance with the test specifications
At the moment, the results and certificates of the VSTEP.3-5 test are used by many
companies as (he requirement for a job position and by many educational institutions as a “visa” for leamers to be accepted into or graduate from an academic
kở
Trang 23program For cxample, English teachers from primary schools and sccondary schools throughout Vietnam are expected to obtain level 4 in Linglish (equivalent to B2) while the requirement for those working in high schools, colleges and universities is level 5 (equwvalent to Cl) Besides, in order to graduate from universities, English major students need to show the evidence of their English in level 5 (equivalent to C1) and that for uon-Frglish major students is level 3 (equivalent to B1) ‘This shows that the uses of the VSI'P.3-5 test and the decisions that are made from the test cut scores have important consequences for the stakeholders Like other high-stakes (esis such as TORFL, IELTS, PTR, or Cambridge ‘fests, in order to gain credibility and defensibility, more research needs
to be conducted on the test in general and the validity of the VSTEP.3-5 cut scores
in particular, However, so far, there have been few studies on the VSTEP.3-S test
and there is no validation research on the cut scores of the test
Among the skills tested in high stakes examination, listening, is the skill that the fewest researchers choose to conduct a study on According to Buck (2001), the assessment of listening ability is onc of the least understood and least developed areas of language and assessment However, Buck (2001) also states that the assessment of listening ability is one of the most important testing aspects In terms
of standard setting and cut score validation, the procedure for Listening tests is also much more complicated and time-consuming Lowever, for the author of this study, listening is a skill that is really interesting and thus needs discovering
All of the reasans mentioned above have intrigued the author of this doctoral thesis
to conduol a validation study on he cut scores of the VSTRP.3-S listening Lest by using validity argument-based model proposed by Kane (2013) A validity
argument is a set of related propositions that, taken together, form an argument in
support of an imlended use or interpretation of the test scores With the deoply- rooted desire to develop a good proficiency listening test in Vietnam, this research
is expected Lo bring the author of this doctoral thesis a profound insight into this
specific arca of interest for her future professional development,
Trang 242 Objectives of the study
As mentioned, since the VSTEP.3-5 test is a newly developed high-stakes test, the
need to standardize it is imperative ‘Thus, this doctoral research is conducted as an
ongoing attempt in building a systematic, transparent and defensible body of
validity argument for the VSTHP.3-5 test in general and its listening component in particular By adopting the argument-based approach recommended by Kane
(2013), the study aims at investigating the validity of the cut-scores of the
VSTHP.3-5 listening test
3 Significance of the study
This study will be a significant cndeavor in building a systematic, transparent and defensible body of validity argumentation for the VSILP.3-S test in general and its lisiening component iv parbular This study will also contribute to Ihe practice of validating the cut score validity of a test by adopting the argument-based approach Moreover, the results of this study will be helpful in providing a close look at the lest specifications of Ihe VSTEP.3-5 listening test, the test development provess and the establishment of the cut scores of the test These results can provide evidence to either support the reasonableness of the test specifications of the VSTEP.3-5
listening test, the test development process and the establishment of the cul scores
of the test or suggest the adjustment for them,
4 Scope of the study
In the current context of English testing and asscssment in Victnam, the cut scores
of the VSTEP 3-5 listening test are pre-established and applied for all of the test forms which are supposed to be stricily designed in accordance with the specifications Thus, when a VSYTHP.3-5 listening test is delivered by any authorized institution, it is supposed to have been constructed based on the specifications so thal the cul scores can be interpreted in the preset way Within the scope of this study, the focus is on the validation of the cut-scores of the VS'TEP.3-5 listening test administered in early 2017 by ane institution permitted by the Ministry
Trang 25of Education and Training, Vietnam (MOET) to design and administer the
VSTLP.3-5 test (hereinafter referred to as the VS''LIP.3-5 listening test) Based on the argument thal im order for the cul se
of the VSTEP.3-5 listening test to be
valid, (1) the test tasks and test items are designed im accordance with the test
specifications; (2) the test scores are reliable in measuring test-takers’ proficiency; (3) the cul scores are reasonably established so thal they are uselud for making decisions about test takers’ English listening competency ‘hus, the three aspects of the test that will be taken into consideration in this study include: (1) the design of
test lasks and lest ilemms, (2) the lesL reliability, and (3) the accuracy of the cut scores
5 Statement of research questions
Based on the interpretive argument for the validity of the cut scores of the VSTEP.3-5 listening test, there is one main research question for the study, which
then needs to be clarifying by direc other sub-questions
The main rescarch question is:
To what extent do the cut scores of the VSTEP.3-5 listening test provide reasonable imerpretation of the test-takers’ listening ability?
The (twee sub-questions that help to clarify the main research question are
1, To what extent are the test tasks and (he test iloms of tic VSTEP.3-5
listening test properly designed in accordance with the specifications?
k2 To what extent are the VSTEP.3-5 listening test scores reliable in
measuring the test takors’ English proficiency?
3 To what extent are the cut scores reasonably established for the
VSTEP.3-5 listenmg lest?
6 Organization of the study
The study consists of six chapters as follows:
Chapter I: Introduction,
Trang 26Chapter II: Literature Review
Chapter IIL: Methodology
Chapter IV: Data analysis
Chaplor V: Findings and Discussions
Chaplor VI: Conclusion
Chapler T is aimed al introducing the lopic of the study and presenting the mains reasons for the author to implement this project
Chapter UI is to provide profound theoretical and empirical background with a critical discussion on the relevant concepts, models, or theories for the study
Chapter III presents the context of the study and how the study is conducted
together with a review on each selected method
Chapler TV presents the dala analysis of the sludy
Chapler V presents the findings of the study and discusses Ihese resulls
Chapler VI has two aims First, il specifies the limitations of the study Second, iL
suggests some directions for future studies
Trang 27CHAPTER II LITERATURE REVIEW
‘This chapter reviews theories and research which are fundamental to the current
study The first part of this chapter starts with the presentation on how the concept
of validity has changed over the years Then, it discusses the validation approaches and procedures in articulating a validation argument before describing different kinds of evidence that can be collected in support of a validation argument The second part of this chapter focuses on standard setting including definitions of the concept, an overview of different standard setting methods, a discussion of common
elements in standard setting and issues related to standard setting validation The
third part of this chapler first addresses ihe issucs aboul testing listening and ther presents the framework for analyzing the listening test tasks The fourth part of this
chapter describes the important statistical analysis for a language test including item
dillicully, item discrimination and test reliability Finally, the review of related
validation studies ends the chapter
1 Validation in language testing
1.1 The evolution of the concept of validity
Validity is considered one of the most important concepts in psychometrics, but as Sireci (2009) states validity has taken on many different meanings over the years
In the early 20" century, validity was primarily defined in terms of the correlation
of tesl scores wilh some other criteria and tests were described as valid for anything they correlated with (Kelly, 1927, Thurstone, 1932, Bingham, 1937) Another theoretical definition of validity was also proposed by Garret (1937) IIe defined sunply that validity is the degree to which “a test measures what it is supposed to
Trang 28measure” (Garret, 1937, p.324), However, this definition has met a lot of criticism for a long time since it is considered an important requirement but still insufficient
to validate Les
By the middle of the 20" century, Rulon (1946) was one of many psychometricians
(bel, 1956; Guilliksen, 1950; Mosier, 1947, and Cureton, 1951) calling for a more
comprehensive treatment of validity theary and test validation In 1954, the
Technical Recommendations (APA, 1954) proposed four types of validity — namely,
predictive validity, concurrent validity, content validity and construct validity In contrast, Anastasi (1954) organized the presentation of validity in terms of face
validity, content validity, factorial validity and empirical validity before adopting
the framework promulgated in the 1954 Standards However, the framework then
was presented in terms of cantent validity, criterion-related validity and construct
validity (Anastasi, 1982, Cronbach, 1935, 1960) Tn this framework, predictive
validity and concurrent validity were seen as criterion-oriented validity whereas
more space was allocated to construct validity and content validity was mentioned
in an additional extended discussion of proficiency tasts and construel validily
In the Standards for Educational and Psychological Testing (AERA cl al., 1985), validity is defined as follows: “I'he concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Test validation is the process of accumulating evidence to support such inferences
A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference Validity, however, is a unitary concept” (APA, 1985, p.9) TL was clearly slaled that
“the inferences regarding specific uses of a test are validated, not the test itself” (APA, 1985, p.9)
The definition m the 1985 Standards was well explained and elaborated by Bachman (1990) According to Bachman (1990), “the process of validation starts with the inferences that are drawn and the uses that are made of scores ‘hese uses
Trang 29and inferences dictate the kinds of cvidence and logical argument that are required
to support judgments regarding validity Judging the extent to which an
interpretation or usc of a given fest score 1
valid thus requires the colleckon of
evidence supporting the relationship between the test score and an interpretation or use” (Bachman, 1990, p.243) Messick (1989) defines validity as a unitary concept and deseribes validity “as am integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and hactions based on test scores” (p.13) This
theoretical conceplualivalion of validily as a umilary concep! has been widsly
endorsed by the measurement profession as a whole (Shepard, 1993; ALKA et al,
1999; Messick, 1989; Kane, 1992, 2006, 2009, 2013) and it has a strong influence
on current! practice in cducalioral and psychologival testing in many parts of Ihe world In this study, this conceptualization of validity as is adopted as the
theoretical foundation for understanding different aspects of validity
1.2 Aspects of validity
There are several aspects of this conceptualization of validity that need bearing in xuind First listed by Linm and Gronlund (2000) and then fullilled by Kane (2009, 2013), five aspects of validity can be summarized as follows, Kirst, it is the interpretation and uses of test scores that are validated, and not the tests themselves
It would be more correct to speak of validity of the uses we make of test scores, or speak of test scores as valid indicators or measures of particular ability Llowever,
it can be quite reasonable to talk about the validity of a test if only an interpretation
or use has already beon adopled explicitly or imphcilly
Scoond, validity is a malter of degree, and it may ehange over time as the interpretation/use develop and as new evidence accumulates Since a particular test score can never provide @ perfectly accurale measure of a giver ability, and the validity of interpretation and use always depends on the logic of the interpretive argument and the strength of the evidence provided in support of this argument, we
Trang 30can never prove that cur interpretation and use are valid Thus, it is better to provide evidence that the intended interpretation and wse are more plausible than other
interpretation that might, be offered
Third, validity is always specific to a particular use or interpretation When a test is developed, we always have a particular set of interpretation and use in mind ‘These intended interpretation will depend on how the construct or ability to be measured is defined The way we define a given ability may be different from the way it might
be defined for different purposes, or different groups of test-takers ‘I'hus, the scores
of a particular test will not necessarily be appropriately for other situations or other
purposes
Fourth, validity is a unilary concep! We oficn hear about different typos of validity
such as content validity, concurrent validity, predictive validity or construct
validity However, as defined above, validity is a single quality of the ways in which we use a particular tes Many different kinds of evidence, such as test
content analysis or correlations with other measures of ability can be provided in
support of the intended interpretation and use For this reason, the evidence needed
for validation depends un the interpretation and use and different interpretation or
use will require different kinds and different amount of evidence for their validation
Fifth, validity involves an overall evaluative judgment Since a validation argument
typically includes several parts and is supported by different kinds of evidence,
none of which by itself is sufficient enough to justify the intended inferences and
uses of a particular test Thus, when evaluating the validity of inferences and uses, it
is important to consider the itlerprelive argument in its entirely, as well as all the
supporting evidence
In summary, with the aspects described above about validity, investigating the
validity of test use, which is called validation, can be seen as the process of building
an interpretive argument and collecting evidence in support of that argument (Kane,
Trang 311992, 2006, 2013) This validation approach was termed as argument-based approach to validation by Kane (1992) and is widely applied in the practice of validation nowadays,
1.3 Argument-based approach to validation
Kane (1992, 2006, 2013) recommends thai researchers follow an argument-based approach to make the task of validating inferences from test scores both scientifically sound and manageable In this approach, the validator builds an
argument, hal focuses on defending the inferpretation and use of tes scores for a
particular purpose and is based on empirical evidence to support the particular use According to Kane (2006), validation consists of two types of arguments, an
inlerprelive argument and a validity argument The interpretive argument is bunt
‘upon a number of inferences and assumptions that are meant to justify score
interpretation and use whereas the validity argument provides an evaluation of the
interpretive argument in terms of how reasonable and coherent it is as well as how plausible the assumptions are (Cronbach, 1988) ‘[o be more specific, the establishment of an interpretive argument involves (1) the determination of the
milerences based on test scores, (2) the articulation of assumptions, (3) the decision
on the sources of evidence that can support or refute those inferences, (4) collection
of appropriate data and (5) analysis of the evidence In order for the interpretation or
‘use of test scores to be valid, all of the inferences and assumptions wherent in the interpretation or use of test scores have to be plausible
Kane (1992, 2006, 2013) cites Toukmin’s (1958, 2003) framework as a guide for applying his approach to validation Toulmin’s (1958) framework essentially requires thal a chain of reasoning be established that is able to build a case towards
a conclusion, which in this case would be to determine the plausibility and reasonableness of score inlerprelation and use Figure 2.1 shows Toulnin’s (1958, 2003) argument structure, which is built on several components, including the grounds, claim, warrant, backing, and rebuttal
Trang 32Backing |_Rebuttat
Warrants mt
Figure 2.1; Model of Toulmin’s argument structure (1958, 2003)
In terms of test score interpretation and use, the claim of an argument is the
conclusion drawn about a test-taker based on test performance whereas the grounds serve as the data or observations upon which the claim is based upon An example can be illustrated by the case given by Mislevy et al (2003) One may make the
claim that the student’s English speaking abilities are inadequate for studying in an English-medium university based on the grounds/observation that the student did
not perform well in the final oral examination in an intensive English class The
performance that the teacher observed when the student spoke in front of the class
on the assigned topic is characterized by hesitations and mispronunciations
In the interpretive argument underlying the claim about the student’s readiness for
university study, the inference linking the grounds to the claim is not given and therefore justification is needed in the form of a warrant (or assumption) Figure 2.1 shows the interpretive argument as consisting of one inference, which is authorized
by a warrant The warrant in Toulmin’s model is regarded as a rule, principle, or established procedure that is meant to provide justification for the inference
connecting the grounds to the claim In the example provided above, the warrant is the generally held principle that hesitations and mispronunciations are
characteristics of students with low levels of English speaking ability, who would
have difficulty at an English-medium university
Trang 33The warrants in turn neod backing (or evidence) which comes in the form of theories, research, data, and experience In relation to the example provided above, backing mighL be drawn from (eacher’s laiing and previous experience with non- native speakers at an English-medium university Stronger backing could be obtained by having the student’s speaking performance rated by another teacher and
then showing the agreement between the two raters
Finally, a rebuttal also acts as a link between the ground and claim but serves to
weaken the initial argument by providing evidence or possible explanation that may call into question the warrant Going back to the previous example, a possible
rebuttal ma
y be that the assigned topic for the oral presentation required the student
to use highly technical and unfamiliar vocabulary This rebuttal would scrve to
weaken the inference connecting the grounds — that the oral presentation contained
many hesiations and mispronunciations - and the claim that the student’s speaking,
ability was at a level that would not allow him to succeed at an English-medium
interpretation and use of test scores, it is necessary to evaluate the clarity,
coherence, and completeness of the interpretive argument and to evaluate the
plausibility of cach of the ferences and assumptions in the interpretive argument The proposed interpretation and use of the test scores are supposed to be explicitly
stated in terms of a network of inferences and assumptions, and then the plausibility
of the watranls [or these inferences cau be evaluated using relevant evidence The
validation can be simple if the interpretation and use are simple and limited, involving a few plausible inferences Tn cortrast, if the interpretation and use are
Trang 34ambitious, involving a more extensive network of inferences, the validation will need more evidence that is extensive
‘There are some advantages of this approach summarized by Kane (1990, 2006) First, the argument-based approach can be applied to any types of test interpretation
or use It does not discourage the development of any kind of interpretation 1t does not preclude the use of any kind of data collection technique in developing a measurement procedure Tt does not identify any kind of validity evidence as being generally preferable to any other kind of validity evidence It suggests that the interpretation be stated as clearly as possible and that the validity evidence should
‘be consistent, with the trerpretation
Second, the argument-based approach (o validation provides definite guidelines for systemativally evaluating the validity of proposed interpretation and use of test scores In ather words, this approach provides the researcher with a clear place to
begin and a direction to follow and as a result, this helps lhe researcher to focus
serious attention on validation
‘Third, in the argument-based approach, the evaluation of the interpretive argument does not lead to any absolute decision about validity but it does provide a way ta
measure progress (Kane, 2006) Tn other words, this approach views the validation
as an on-going and critical process instead of just answering either “valid” or
“invalid” Because the most problematic inferences and their supporting
assumplions are checked and are cither supported by the evidence or al least less
problematic, the reasonableness of the interpretive argument as a whole can
improve
Fourth, the argument-based approach to validation may increase the chance that
yesearch on validity will lead io improvements in measurement procedures Since
the argument-based approach focuses attention on specific parts of the interpretive argument and on specific aspects of measurement procedures, evidence which
indicates the existence of a problem such as inadequate coverage of content or the
Trang 35presence of systematic orors, helps to suggest ways to solve the problem and thereby, to improve the procedure
As stated above, applying the argument-based approach to validation can help a
researcher work on any types of interpretation or use of test scores as long as the
interpretation is stated as clearly as possible and the validity evidence should be consistent with the interpretation With regards to the interpretation and use of the
cut scores of a test, before any inferences and assumptions that help to structure the
interpretive argument are stated, it is necessary to take into consideration important
aspects related to standard setting and elements needed for the application of the
pre-established cul scores for a listening proficiency test The following part will
present the issues related to standard setting and the framework for analysing a
listening test
2 Standard setting for an English proficiency test
2.1 Definition of standard setting
Probably the most difficult and controversial part of testing is standard setting, In
this study, standard setting refers to the establishment of cut scores for a test, Le.,
determining the points on the score scale for separating examinees into performance
categories such as pass/fail Ilowever, in order to have a complete understanding of
the concept, iL is important to get familiarized with the theorelical foundations of the
term
Cizek (1993) suggests an claborate and theorctically grounded definition of standard setting He defines standard setting as “the proper following of a described,
yational syslemn of rules or provedures resultmg in the assignment of a number Lo
differentiate between two or mare states or degrees of performance” (Cizek, 1993, p.10) This definition highlights the procedural aspect of standard setting and draws
on the legal framework of duc process and tachtional definitions of measurement
However, the definition suggested by Cizek (1993) suffers from at least one deficiency in that it addresses only one aspect of the legal principle known as due
15
Trang 36process - that is, a process that is clearly articulated in advance, is applicd uniformly, and includes an avenue for appeal
In contrast to the procedural aspect of due process is the substantive aspect, which
centers on the results of the procedure The substantive due process demands that
the procedure lead to a decision or result that is fundamentally fair However, the
notion of faimess is, to some extent, subjective The aspect of fundamental faimess
is related to what has been called the “consequential basis of test use” in Messick’s
(1989, p.84) explication of the various sources of evidence in support for the use
and interpretation of the test scares
Kane (1994) provides another definition of standard setting that highlights the
conceptual nature of the endeavor He states thal “it is useful to draw a distmetion
between the passing score, defined as a point on the score scale, and the
performance standard, defined as the minimally adequate level of performance for
some purpose The performance slandard is the conveplual version of the dasired level of competence, and the passing score is the operational version” (Kane, 1994, 1.426) For this reason, as stated, this study uses the term “standard setting” and
“out scores” imlerchangeably
Another convepl, which is a key concep! underlying Kane’s definition of slandard
setting, is the concept of inference As discussed in the previous part, an inference is the interpretation, conclusion, or meaning that is intended to make about an
examinee’s underlying, unobserved level of knowledge, skill, or ability Thus, from
this perspective, validity refers to the accuracy of the inferences made about the
examinee, usually based on observations of the examinee’s performance The
assumption underlying the inference of standard selling is thal the passmg scores
creates meaningful categories that distinguish between individuals who meet some
performance standard and those who do not
Thus, for this study, the primacy of test purpose and the intended inference or test score interpretation is essential to understanding the definition of standard setting
Trang 37To validate the standard setting or the establishment of the cut scores is to evaluate the accuracy of the inferences that are made when examinees are classified based on
application of the cut scores
Finally in wrapping up the definition of standard setting it is important to note
what standard setting is not According to Cizek and Bunch (2007:18), “standard setting does not seek to find “true” cut scores that separate real, unique categories
on a contimious competence” since there is no external truth for all the things we
care about, no set of minimum competences that are necessary and sufficient for life success, then all standard setting is judgmental Cizek and Bunch (2007) also state
that Lo some degree, because slandard setimg necessarily involves human opiions
and values, it can be also scen as a combination of technical, psychometric methods
and policy making Seen in this way, standard setting can be seen as a procedure
that enables participants lo bring their judginerits by using a spccilied method in
such a way as to translate the policy positions of authorizing entities into locations
ona scare scale
2.2 Overview of standard-setting methods
In recent years, more attention has been paid to establishing the oredibility of
existing standard-setting methods and investigating new methods Many of the new
approaches have been developed to present judges with meaningful activities and to better accommodate the changing nature of assessments Cizek & Bunch (2007)
summarize 18 different methods of cul-score establishment However, one common
element of all these methods is that they involve, to one degree or another, human
beings expressing informed judgments based on the best evidence available to them,
and these judgments are suumuarized in some sysicmatic way, typically with the aid
of a mathematical model, to produce one or more cut scores Cizek & Bunch (2007)
state thal each standard methad combines art and science and different methods
may yield different results For this reason, it is hard to say that the cut scores established by one particular method are always superior to those set by another
Trang 38anethod, However, it can be stated that some methods can be better suited to vertain
types of tests or ci
LUambleton and
ircumstances
Pitoniak (2006) classify standard-setting methods into four
categories: (1) methods that involve review of test items and scoring rubrics, (2) anethods that inv
examinee work;
olve review of examinees, (3) methods that involves looking at
and (4) methods that involve panelist review of score profiles
Table 2.1 summarizes standard-setting methods, organized by type of rating
Angoif Panelists estimate the probability that the borderline examinee will
answer each mullipile-choice item conectly
Extended Angoff | This method is similar to Angoff for multiple-choice items, cxecpt that and related the panelists rate polytomous items (or with the analytic method, items methods within a simulation specifically) and indicale the umber of score
points (or average score of 100 borderfine examinees) that the borderline examinee will obtain
Ebel Panelists estimate each item along two dimensions — difficulty and
rolevanes For cach combination of the dimensions, panclists estimals the percentage of items that the bordertine examinee will answer correctly
think the bordering examinee would be able to mile out as incorrect; the reciprocal of the number of distractors not ruled out, plus one (for the comect answer), is taken as the probability that the borderline
cxamince will answer the item comcetly by resorting to random
Trang 39
selection among the remaining answer choices
Panelists make a yes/no rating for cach item in terms of whether every examinee (not just the borderline examinee) should be able to answer
the item coreetly
In the bookmark method, panelists review test items that are ordered according to their difficulty parameter and place a bookinark where the
another group whose meuibers are clearly below that standard
Borderline group | Raters review lists of candidates and identuiv those that they view as
bordertine
Methods involving ratings of examinee work
Item by item Panelists review, item by item, samples of candidate responses to items
(paper selection) | in the test representing each test score and select the papers they view
as representing the borderline cxaminee
Holistic (body of | Panelists review, holistically, a sample of entire candidate test booklets work, booklet | and place them in different performance categories
classification,
Trang 40
classification)
Methods involving ratings of score profiles
Jadgmental Panelists review score profiles across exercises making up a policy capturing | performance assessment and assign each score profile to a proficiency
Beuk Panelists provide estimates of the expected percent correct for the
borderline examinees and the expected passing rate for the examinee population
De Gruijter Panelists provide estimates of the expected percent comect for the
borderline examinees and the expected failing rate for the examinee poputation, phus an estimals of uncerlamly for both values