Luận văn nghiên cứu xác trị các Điểm cắt của kết quả bài thi nghe Đánh giá năng lực tiếng anh từ bậc 3 Đến bậc 5 theo khung năng lực ngoại ngữ 6 bậc dành cho việt nam

Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performance

Trang 1

This dissertation was completed at the University of Languages and

International Studies, Vietnam National University, Hanoi

This dissertation was defended on 10 May 2018

This dissertation can be found at:

- National Liberary of Vietnam

- Liberary and Information Center Vietnam National University, Hanoi

Trang 2

DECLARATION OF AUTHORSHIP

I hereby certify that the thesis I am submitting is entirely my own original work except where otherwise indicated I am aware of the University's regulations concerning plagiariom, including those regulations concerning disciplinary actions that may result from plagiarism Any use of the works of any other author, in any form, is properly acknowledged at their point of use,

Date of submission:

Ph.D Candidate's Signature:

Trang 3

T certify that I have read this dissertation and that, in my opinion, it

is fully adequate in scope and quality as a dissertation for the degree

of Doctor of Philosophy

Prof Nguyễn Hòa

(Supervisor}

I cernfy that I have read this dissertation and that, in my opinion, it

is filly adequute in scope and quality as a dissertation for the degree

of Doctor of Philosophy

2x44

Prof Fred Davidson

(Cu-supervisor)

Trang 4

2 Standard setng for an knglish proñeieney test co cieeeereeeesoees— TR

3.3 Common elements in standard setting TH eee " 21 2.3.1 Selevling a standard-setting method _ - - 21

1.3.3 Preparing descriptions of performance-level descriptors - 24

2.3.6 Compiling ratings and obtain cut scores 27

iv

Trang 5

2.4.2 Internal evidence

2.4.3 External evidence

Comparisons to other standard-setting methods

Comparisons to other sources of information

2.4.3.3 Reasonableness of cul scores:

3 Testing listening

3.1 Commmmnicative language testing

3.3 Listening construct

4, Statistical analysis for a language test

4.1 Statistical analysis of multiple choice (MC) items

4.3 Investigating reliability of a language IosL

5 Review of validation studies

5.1 Review of validation studias on standard setting

5.2 Review of studies employing argument-based approach in validating language

ÂsfS

CHAPTER IT: METHODOLOGY

1 Context of the study

LL About the VIEP.3-.5 test

1.1.1 The development history of the VSTRP

1.1.2 The administration of the VSTEP.3-5 1.1.3 Test takets

5 test,

in Vialnam

1.1.4 Test stmneture and scoring rubrics

1.1.5 The establishment of the cut scores 1.2 About the VSTEP.3-5 listening test

1.2.1 Test purpose

1.2.3 Performance standards

1.2.4 The establishment for the cut scores of the VSTHP.3-5 listening test

2 Building an interpretive argument for the VSTRP.3-5 listening tosL

3 Methodology

Trang 6

3.1 Research questions eee cence - ¬ 70

3.2.1 Analysis of the test tasks and test items T2

3.2.2 Analysis of tast reliabilÏfy cà cành se tt " T5

3.3 Description af Bookmark standard soling propeduros 78

3.4.1 Test takers of early 2017 adminiswation 81 3.4.2 Participants for Dookmark standard setting method we 82

3.5.3.2, Iiemah 4L cece sess eeeeensseesssessesessieessessees 86

1.1.3 Relationship between the input and response cee 102

1.2.1 Overall statistics of item difficulty and item discrimination 102

Trang 7

1.2.2 Hem analysis

2 Amalysis of the test reliability

3 Analysis of the cut-scores,

3.1 Procedural evidence

3.2 Internal evidence

3.3 Extemal evidence

CHAPTER VY: FINDINGS AND DESCUSSIONS

1 The characteristics of the (ssl tasks and test iterns

2 The reliability of the VSTEP.3-5 listening test

3 The accuracy of the cut scores of the VSTEP 3-5 listening test

CHAPTER VI: CONCLUSION

1, Overview of the thesis

2 Contributions of the study.,

3, Limitations of the study

4, Implications of the study

5 Suggestions for Ruther rasearch

LIST OF TIIESIS-RELATED PUBLICATIONS

REFERENCES

APPENDIX L: Structure of the VSTEP.3-5 test

APPENDIX 2: Summary of the directness and intera

questions of the VSTEP.3-5 listening

APPENDIX 3: Consent form (workshops)

APPENDIX 4: Agenda for Bookmark standard-setting procedure

APPENDIX 5: Panclist recording form

APPENDIX 6: Evaluation form for standard-setting participanfs

APPENDIX 7: Control file for WINSTEPS

APPENDIX 8: Timeline of the VSTEP.3-5 test administration

APPENDIX 9: List of the VSTEP.3-5 developers

Trang 8

LIST OF FIGURES

Figure 2.1: Model of Youlmin’s argument structure (1958, 2003)

Figure 2.2: Sources variance in test stores (Bachman, 1990)

Figure 2.3: Overview of interpretive argument for ESL writing course placements Figure 4.1: Item map of the VSTEP.3-5 listening test

Vigure 4.2: Graph for item 2

Figure 4.3: Graph for ilem 3

Figure 4.4: Graph for item 6

Figure 4.5: Graph for itema 13

Figure 4.12: Total score for Ihe seorad iterns

viii

Trang 9

LIST OF TABLES

“†ablz 2.1: Review of standard-setting methods (Hambleton & Pitoniak, 2006)

Table 2.2: Standard selling Evaluation lements (Cizek d& Bunch, 2007)

Table 2.3: Common steps required for standard setting (Cizek & Bunch, 2007)

Table 2.4: A tiamework for defining listening task characteristics (Buck, 2001)

‘Lable 2.5: Criteria for item selection and interpretation of item difficulty index

Table 2.6: Criteria for item selection and interpretation of iter discrimination indzx

Table 2.7: General guideline for interpreting test reliability (Bachman, 2004)

‘Table 2.8: Number of proficiency levels & test reliability

Table 2.9: Summary of the warrant and assumptions associated with cach inference in the

‘TOEEL interpretive argument (Chapelle et al., 2008)

‘Table 3.1: Souclure pf the VSTEP.3-5 tesl

Table 3.2: The cut scores of the VSTEP.3-5 test

Table 3.3: Performance standard of Cverall Listening Comprehension (CEFR: leaning,

teaching, assessment)

Table 3.4: Performmme standard of Undorstanding eanversation between nalive spoakars

(CEFR: leaming, teaching, assessment)

‘Table 3.8: Performance standard of Listening as a member of a live audience (CUR

1emning, sacking, 2

ment)

Table 3.6: Performance standard of Listening to announcements and instructions (CEFR:

learning, teaching, assessment)

‘Table 3.7: Performance standard of Listening lo andio media and recordings (CEPR

Jearning, teaching, assessment)

‘Lable 3.8: ‘he cut scares of the V

Table 3.9: Criteria for item

Table 3.10: Criteria for item selection and interpretation of itam discrimination index

3P.3-5 test

ction and interpretation of stern difficulty index

‘Vable 3.11: Number of proficiency levels & test reliability

Tal

Table 3.13: Comparison botw

IELTS grading systems

12: The venue for Angafl and Bookmark standard setting method

the Flesch-Kincaid roadability analysis and the CEFR -

Trang 10

‘fable 3.14: Summary of the interpretative argument for the interpretation and use of the

Table 4.1: General instruction of the VSTEF.3-5 listening test 90

Table 4.4: Instruction for Part 3 5: St tt HH He cư 9

‘table 4.5: Information provided in the specifications for the VSTEP 3-5 listening test 94

Table 4.7: Description of language levels for texts of items 1-8 in the speoiication 9

Tabls 4.9: Description of language levels for loxts of items 9 -20 in the spscification 99 Table 4.10, Summary of the texts for items 21-35 cece 100 Table 4.11: Description of language levels for texts of items 21-35 inthe specification 101 Table 4.12: Summary of item discrimination and item difficulty 104 Table 4.13: Summary statistics for the flagged items cece 106

‘Vable 4.14: Information for item 2 00.ccscescecess esters 108

TTable 4.16: Option statistics for item 2 109

Tables 4.19: Tom statistics for item 3 " "— a MG

'Table 4.21: (uantile plot data For item 3 - - - cee UL

Table 4.24: Option statistics for item 6 - - ¬

'†ablz 4.26: Information for item 1 ìccinntnnrererrrreaeerererue TES

Table 4.29: Quantile plot data for item 13.0 00.c cece HG

Trang 11

: Information for item L4

Tern statistics for item 14

Option statistics for item 14

: Quantile plot data for item 14

Information for ilern ES

Item statistics for item 15

Quantite plot data foritem 15

Option statisties for iter 19

: Quantile plot data for item 19

Quantile plot data for item 20

Information for ilern 28

Quantile plot data for item 28

Information for ilern 34

Option statisties for item 34 Quantile plot data for item 34

Infonnation for item E9

Information for item 20

The person reliability and item reliability of the test

Number of proficiency levels and test reliability

"The test reliability of the VSTEP.3-5 listening test,

Trang 12

Round 3 Feedback for Bookmark Standard-setting Procedure

Surumary of Oulpat from Round 3 of Bookrunk standard.selling Procadhae The cut scores set for the VSTEP.3-5 listening test by Bookmark method

‘The cut scores set for the VSTEP.3-5 listening test by Angoff method

Comparison betwaan the results of two standard-sotling methods

Trang 13

LIST OF KEY TERMS

Construct: A construct refers to the knowledge, skill or ability that's being tested

In a more technical and specific sense, it refers to a hypothesized ability or mental

trait which cannot necessarily be directly observed or measured, for example,

listening ability Language tests attempt to measure the different constructs which underlie language ability

Cut score: A score that represents achievement of the criterion, the line between

success and failure, mastery and non-mastery

Descriptor: A brief description accompanying a band on a rating scale, which

summarizes the degree of proficioney or type of performance expected for a test taker to achieve that particular score

Distractor: ‘he incorrect options in multiple-choice items

Expert panel: A group of target language experts or subject matter experts who

provide comments about a test

High-stakes test: A high-stakes test is any test used to make important decisions

about tesl takers

Taference: A conclusion that is drawn about something based on evidence and

reasoning

Input: Input material provided in a test task for the test taker to use in order to

produce an appropriate response

Interpretive argument: Statements that specify the interpretation and use of the test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results, Ltcm (also, test item): Hach testing point in a test which is given a separate score or

scores Examples are: one gap in a cloze test, ong multiple choice question with

xi

Trang 14

three or four options, one sentence for grammatival transformation; one question to which a sentence-length response is expected

Key: ‘fhe correct option or response to a test item

Multiple-choice item: A type of test item which consists of a question or incomplete sentence (stem), with a choice of answers or ways of completing the sentence (options) ‘Ihe test taker’s task is to choose the correct option (key) fram a set of possibilities There may be any number of incorrect possibilities (distractors) Options: The range of possibilities in a multiple-choice item or matching tasks from which the correct one (key} must be selected,

Panelist: A target language expert or subject mailer experL who provides comments

of cut scores to examinee performance on a test

Performance standard: The abstract conceptualizalion of the minimum level of

performance distinguishing cxaminces who possess an acceptable level of kmowledge, skill, or ability judged necessary to be assigned to a category, or for

some olher specific purpose, and those wha do nol possess thal level This term is

sometimes used interchangeably with cut score

Proficiency test: A test which measures how much of a language someone has leamed Proficiency tests are designed to measure the language ability of examinees regardless of how, when, why or umder what circumstances they may have experienced the language

xiv

Trang 15

Readability: Readability is the caso with which a reader can understand a written text, The readability of text depends on its content (the complexity of its vocabulary and syntax) and ils presentation (such as typographic aspects like font size, line

height, and fine length)

Reliability: ‘The reliability of a test is concerned with the consistency of scoring and the accuracy of the administration procedures of the test

Response probability (RP) criterion: In the context of Bookmark and similar tem-mapping standard-selling procedures, the criterion used to operationalize

participants’ judgment regarding the probability of a correct response (for dichotomously scored items) or the probability of achieving a given score point or

higher (for polylomously scored items) Tn practical applications, two PR mrilcria

appear to be used most frequently (RPSO and RP67); other PR criteria have also

been used though considerably less frequently

Rubric: A set of instructions or guidelines on an exam paper

Selected-response: An item format in which the test taker must choose the correct

answer from alternative provided

Specifications (also, test specifications): A description of the characteristics of a lest, including whal is tosted, how iis tesled, and details such as number and length

of forms, item types used

Standard setting: A measurement activity in which a procedure is applied to

syslematically gather and analyze human judgment for the purpose of deriving one

or more cut scores for a test

Standardized test: A standardized test is any form of test that (1) requires all test takers to answer the same questions, or a selection of questions from common bank

of questions, in the same way, and that (2) is scored in a “standard” or consistent manner, which makes it possible lo compare the rehilive performance of inslividual

students or groups of shidents

xv

Trang 16

Test form: Test forms refer to different versions of tests that are designed in the same format and used for different administrations

Validation: An action of checking or proving the validity or accuracy of something The validity of a test can only be established through a process of validation

Validity: The degree to which a test measures what it is supposed to measure, or

can be used successfully for the purpose for which it is intended A number of different statistical procedures can be applied to a test to estimate its validity Such

procedures generally seek to delsrmine whal the fest measures, arxt how well it does

Trang 17

ABSTRACT

Standard setting is an important phase in the development of an examination program, especially for a high-stakes test Standard setting studies are designed to identify reasonable cul scores and to provide backing for this choice of ont scores

‘This study was aimed at investigating the validity of the cut scores established for a VSTEP.3-5 listening test administered in early 2017 on 1562 test takers by one institution pormitted by the Ministry of Education and Training, Vietnam to design and administer the VSTLP.3-S tests ‘fhe study adopted the current argument-based validation approach with a focus on three main inferences constructing the validity argument They were (1) Lest tasks and items, (2) tosl reliabilily and (3) cul scores The argument is that in order for the cut-scores of the VSTHP.3-5 listening test to

‘be valid, the test tasks and test items first needed to be designed in accordance with

the characterislics specified in the speciltcations Sccond, the listening {esl scores should be sufficiently reliable so as to reasonably reflect test-takers’ listening

proficiency Third, the cut scores were reasonably established for the VSTEP.3-5

listening test

In this study, both qualitative and quantitative methods were combined and structured to back for and against the assumptions in each of these three inferences

With regards to the first inference and second inference, an analysis of the test tasks

and the test items was conducted whereas test reliability was investigated in order to

see if it was in the acceptable range or not In terms of the third inference about the

cut scores of the VSTEP.3-5 listening test, Bookmark standard setting method was

implemented and the results were compared with those currently applied for the test This study offers contributions in three areas First, this study supports the

widely-held notion of validily as a unitary concepl and validation is the process of

building an interpretive argument and collecting evidence in support of that

argument Second, this study contributes towards raising the awareness of the

Trang 18

importance of evaluating the cut scores of the high stakes language tests in

Vietnam so that fairness can be ensured for all of the test takers Third, this study

camtribules to [he construction of a syslomatic, bansparonl and defensible body of validity argument for the VSTLP.3-5 test in general and its listening component in particular The results of this study are helpful in providing informative feedback to

the establishment, of the cut scores for the VSTEP.3-5 listerung test, Lhe Lesh

specifications, and the test development process ‘I'he positive results can provide avidence to strengthen the reasonableness of the cut scores, the specifications and

the qualily of the VSTEP.3-5 listening lest The negative resulls can give

suggestions for changes or improvement in the cut scores, the specifications and the design of the VSTEP.3-5 listening test.

Trang 19

appreciate all Ins conlribulions of time, ideas and other assistance Lo make my PhD

experience productive and stimulating lls enthusiasm and encouragement were

motivational for me, making my Ph.D pursuit a short and enjoyable journey I am

also very graleful lo him [or involving me in bis various research projects, which has provided me with a lot of experience in conducting this study He has been a

tremendous mentor

T would also like to thank my co-supervisor, Fred Davidson, Professor Emeritus from the University of Ilinois, for giving mo the very first ideas, advice and guidance on how to start my Ph.D study His advice on both research as well as on

ay career has been invaluable

I am especially thankful to Professor Nathan T Carr from California State University, Fullerton for conducting a series of workshops on designing and analyzing language tests at the University of Languages and Intemational Studies, Vietnam National University - Hanoi Being able to discuss my work with him has een invaluable for developing my ideas Sharing his knowledge and experience about language testing and assessment in general and standard-setting methods in

particular have been 4 greal conlribution Lo the completion of my Ph) thesis

I want to thank all of my colleagues at the University of Languages and Intemational Studies, Vietnam National University - Hanoi, ospecially my

Trang 20

colleagues at the Center for Language Testing and Assessment, for sharing my workload and always cheering me up when | was down

My sincere thanks also go to Dr Lluynh Anh ‘tuan, Dean of the faculty of the Post-

praduate Studies and his staff for helping me to process the paperwork and

constantly reminding me of the deadlines Without their support and encouragement, I would have postponed my thesis defense for one or two more

years

Words cammol express how graleful Tam to my family 1 want to say thank you 1a

any parents and siblings for their encouragement during the time I conducted my

study

‘This thesis is dedicated to my beloved husband and my daughter for their love, endless supporl, encouragement and sacrifices Uhroughoul Ibis experience

As a final word, T would like to thank cach and every individual who has been a

source of support and encouragement and helped me to achieve my goal and complete my thesis work successfully,

XK

Trang 21

CHAPTER I INTRODUCTION

This chapter is Lo introduce the topic of the study and present the main reasons for choosing it After that, the chapter presents the questions that are going to be addressed within the scope of the study A brief overview of the organization of the thesis will close the chapler

4, Statement of the problem

The term “cut scores” refers to the lowest possible scores on a standardized test,

high-stakes test or other forms of assessment that help to separate a test score scale into two or mare regions, creating categories of performance or classification of

examinees Clearly, if the cul scores are nol appropriately sei, the results of the

assessment could come into question l'or this reason, establishing cut scores for a

test has been considered an important and practical aspect of standard setting In Kane’s (2006) recent discussion for test validation, besides ermphasiving the

importance of carefully defining the selected cut scores, he hightights the evaluation

of the reasonableness of the cut scores and states that the establishment of the cut

scores is a complex endeavor, but fhe validation of the cul scores is even more

difficult

According to the Standards for Educational and Psychological Testing (AERA et

al, 1999, p.9), validity is defined as “the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests” and test, validation is the process of making a case for the proposed interpretation and use of test scores This case takes the forms of an argument that states a series of

propositions supporting the proposed interpretation and usc of test scares and

summarizes the evidence supporting these propositions (Kane, 2006) With regard

to standard setting, since there are no ‘gold standards” and “te cut scores”, to

Trang 22

validate cstablished cut scores means to provide evidence in support of the plausibility and appropriateness of the proposed cut score interpretation, their credibilily and delensibilily (Kane, et al., 1999} In the world, though plenty of studies have been conducted on the validity of cut scores established for a test, these studies mainly aim at cross-validating two different methods of standard setting and comparing the resulls of these methods instead of investigaiing the validily af cul

scores as a whole,

In Vietnam, the National Foreign Language 2020 Project (NFL2020) was initiated

in 2008 with the aim to “renovate the teaching and leaming, of foreign languages

wilhin the nalional education system” so that “ by 2020, most Vietnamese

students graduating from sccondary, vocational schools, colleges and universities

will be able to use a foreign language confidently in their daily communication,

their sludy and work in am integrated, multi-cultural and multi-lingual environment, making foreign languages a comparative advantage of development for Vietnamese

people in the cause of industrialization and modernization for the country”

(Decision 1400/OD-TTg) Language assessmonl is considered a major component

of this project ‘[he biggest achievement of this component is the emergence of the

first ever-standardized test of English proficiency (the VSTEP.3+5 test), The test

was officially released by the Ministry of Education and Training, Victmam on 11% March 2015 ‘Lhe test aims at measuring Linglish ability across a broad language proficiency continuum from level 3 to level 5, which is equivalent to Bl - Cl CEFR levels (Common European Framework of Reference for Languages) ‘Ihe cut scores

of the VSTLP.3-S test help to categorize test takers and certify them based on the

levels they achieve These cut scores are applied for all of the results of the

VSTEP.3-5 tosis, which are supposed Lo be strivlly built in accordance with the test specifications

At the moment, the results and certificates of the VSTEP.3-5 test are used by many

companies as (he requirement for a job position and by many educational institutions as a “visa” for leamers to be accepted into or graduate from an academic

kở

Trang 23

program For cxample, English teachers from primary schools and sccondary schools throughout Vietnam are expected to obtain level 4 in Linglish (equivalent to B2) while the requirement for those working in high schools, colleges and universities is level 5 (equwvalent to Cl) Besides, in order to graduate from universities, English major students need to show the evidence of their English in level 5 (equivalent to C1) and that for uon-Frglish major students is level 3 (equivalent to B1) ‘This shows that the uses of the VSI'P.3-5 test and the decisions that are made from the test cut scores have important consequences for the stakeholders Like other high-stakes (esis such as TORFL, IELTS, PTR, or Cambridge ‘fests, in order to gain credibility and defensibility, more research needs

to be conducted on the test in general and the validity of the VSTEP.3-5 cut scores

in particular, However, so far, there have been few studies on the VSTEP.3-S test

and there is no validation research on the cut scores of the test

Among the skills tested in high stakes examination, listening, is the skill that the fewest researchers choose to conduct a study on According to Buck (2001), the assessment of listening ability is onc of the least understood and least developed areas of language and assessment However, Buck (2001) also states that the assessment of listening ability is one of the most important testing aspects In terms

of standard setting and cut score validation, the procedure for Listening tests is also much more complicated and time-consuming Lowever, for the author of this study, listening is a skill that is really interesting and thus needs discovering

All of the reasans mentioned above have intrigued the author of this doctoral thesis

to conduol a validation study on he cut scores of the VSTRP.3-S listening Lest by using validity argument-based model proposed by Kane (2013) A validity

argument is a set of related propositions that, taken together, form an argument in

support of an imlended use or interpretation of the test scores With the deoply- rooted desire to develop a good proficiency listening test in Vietnam, this research

is expected Lo bring the author of this doctoral thesis a profound insight into this

specific arca of interest for her future professional development,

Trang 24

2 Objectives of the study

As mentioned, since the VSTEP.3-5 test is a newly developed high-stakes test, the

need to standardize it is imperative ‘Thus, this doctoral research is conducted as an

ongoing attempt in building a systematic, transparent and defensible body of

validity argument for the VSTHP.3-5 test in general and its listening component in particular By adopting the argument-based approach recommended by Kane

(2013), the study aims at investigating the validity of the cut-scores of the

VSTHP.3-5 listening test

3 Significance of the study

This study will be a significant cndeavor in building a systematic, transparent and defensible body of validity argumentation for the VSILP.3-S test in general and its lisiening component iv parbular This study will also contribute to Ihe practice of validating the cut score validity of a test by adopting the argument-based approach Moreover, the results of this study will be helpful in providing a close look at the lest specifications of Ihe VSTEP.3-5 listening test, the test development provess and the establishment of the cut scores of the test These results can provide evidence to either support the reasonableness of the test specifications of the VSTEP.3-5

listening test, the test development process and the establishment of the cul scores

of the test or suggest the adjustment for them,

4 Scope of the study

In the current context of English testing and asscssment in Victnam, the cut scores

of the VSTEP 3-5 listening test are pre-established and applied for all of the test forms which are supposed to be stricily designed in accordance with the specifications Thus, when a VSYTHP.3-5 listening test is delivered by any authorized institution, it is supposed to have been constructed based on the specifications so thal the cul scores can be interpreted in the preset way Within the scope of this study, the focus is on the validation of the cut-scores of the VS'TEP.3-5 listening test administered in early 2017 by ane institution permitted by the Ministry

Trang 25

of Education and Training, Vietnam (MOET) to design and administer the

VSTLP.3-5 test (hereinafter referred to as the VS''LIP.3-5 listening test) Based on the argument thal im order for the cul se

of the VSTEP.3-5 listening test to be

valid, (1) the test tasks and test items are designed im accordance with the test

specifications; (2) the test scores are reliable in measuring test-takers’ proficiency; (3) the cul scores are reasonably established so thal they are uselud for making decisions about test takers’ English listening competency ‘hus, the three aspects of the test that will be taken into consideration in this study include: (1) the design of

test lasks and lest ilemms, (2) the lesL reliability, and (3) the accuracy of the cut scores

5 Statement of research questions

Based on the interpretive argument for the validity of the cut scores of the VSTEP.3-5 listening test, there is one main research question for the study, which

then needs to be clarifying by direc other sub-questions

The main rescarch question is:

To what extent do the cut scores of the VSTEP.3-5 listening test provide reasonable imerpretation of the test-takers’ listening ability?

The (twee sub-questions that help to clarify the main research question are

1, To what extent are the test tasks and (he test iloms of tic VSTEP.3-5

listening test properly designed in accordance with the specifications?

k2 To what extent are the VSTEP.3-5 listening test scores reliable in

measuring the test takors’ English proficiency?

3 To what extent are the cut scores reasonably established for the

VSTEP.3-5 listenmg lest?

6 Organization of the study

The study consists of six chapters as follows:

Chapter I: Introduction,

Trang 26

Chapter II: Literature Review

Chapter IIL: Methodology

Chapter IV: Data analysis

Chaplor V: Findings and Discussions

Chaplor VI: Conclusion

Chapler T is aimed al introducing the lopic of the study and presenting the mains reasons for the author to implement this project

Chapter UI is to provide profound theoretical and empirical background with a critical discussion on the relevant concepts, models, or theories for the study

Chapter III presents the context of the study and how the study is conducted

together with a review on each selected method

Chapler TV presents the dala analysis of the sludy

Chapler V presents the findings of the study and discusses Ihese resulls

Chapler VI has two aims First, il specifies the limitations of the study Second, iL

suggests some directions for future studies

Trang 27

CHAPTER II LITERATURE REVIEW

‘This chapter reviews theories and research which are fundamental to the current

study The first part of this chapter starts with the presentation on how the concept

of validity has changed over the years Then, it discusses the validation approaches and procedures in articulating a validation argument before describing different kinds of evidence that can be collected in support of a validation argument The second part of this chapter focuses on standard setting including definitions of the concept, an overview of different standard setting methods, a discussion of common

elements in standard setting and issues related to standard setting validation The

third part of this chapler first addresses ihe issucs aboul testing listening and ther presents the framework for analyzing the listening test tasks The fourth part of this

chapter describes the important statistical analysis for a language test including item

dillicully, item discrimination and test reliability Finally, the review of related

validation studies ends the chapter

1 Validation in language testing

1.1 The evolution of the concept of validity

Validity is considered one of the most important concepts in psychometrics, but as Sireci (2009) states validity has taken on many different meanings over the years

In the early 20" century, validity was primarily defined in terms of the correlation

of tesl scores wilh some other criteria and tests were described as valid for anything they correlated with (Kelly, 1927, Thurstone, 1932, Bingham, 1937) Another theoretical definition of validity was also proposed by Garret (1937) IIe defined sunply that validity is the degree to which “a test measures what it is supposed to

Trang 28

measure” (Garret, 1937, p.324), However, this definition has met a lot of criticism for a long time since it is considered an important requirement but still insufficient

to validate Les

By the middle of the 20" century, Rulon (1946) was one of many psychometricians

(bel, 1956; Guilliksen, 1950; Mosier, 1947, and Cureton, 1951) calling for a more

comprehensive treatment of validity theary and test validation In 1954, the

Technical Recommendations (APA, 1954) proposed four types of validity — namely,

predictive validity, concurrent validity, content validity and construct validity In contrast, Anastasi (1954) organized the presentation of validity in terms of face

validity, content validity, factorial validity and empirical validity before adopting

the framework promulgated in the 1954 Standards However, the framework then

was presented in terms of cantent validity, criterion-related validity and construct

validity (Anastasi, 1982, Cronbach, 1935, 1960) Tn this framework, predictive

validity and concurrent validity were seen as criterion-oriented validity whereas

more space was allocated to construct validity and content validity was mentioned

in an additional extended discussion of proficiency tasts and construel validily

In the Standards for Educational and Psychological Testing (AERA cl al., 1985), validity is defined as follows: “I'he concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores Test validation is the process of accumulating evidence to support such inferences

A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference Validity, however, is a unitary concept” (APA, 1985, p.9) TL was clearly slaled that

“the inferences regarding specific uses of a test are validated, not the test itself” (APA, 1985, p.9)

The definition m the 1985 Standards was well explained and elaborated by Bachman (1990) According to Bachman (1990), “the process of validation starts with the inferences that are drawn and the uses that are made of scores ‘hese uses

Trang 29

and inferences dictate the kinds of cvidence and logical argument that are required

to support judgments regarding validity Judging the extent to which an

interpretation or usc of a given fest score 1

valid thus requires the colleckon of

evidence supporting the relationship between the test score and an interpretation or use” (Bachman, 1990, p.243) Messick (1989) defines validity as a unitary concept and deseribes validity “as am integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and hactions based on test scores” (p.13) This

theoretical conceplualivalion of validily as a umilary concep! has been widsly

endorsed by the measurement profession as a whole (Shepard, 1993; ALKA et al,

1999; Messick, 1989; Kane, 1992, 2006, 2009, 2013) and it has a strong influence

on current! practice in cducalioral and psychologival testing in many parts of Ihe world In this study, this conceptualization of validity as is adopted as the

theoretical foundation for understanding different aspects of validity

1.2 Aspects of validity

There are several aspects of this conceptualization of validity that need bearing in xuind First listed by Linm and Gronlund (2000) and then fullilled by Kane (2009, 2013), five aspects of validity can be summarized as follows, Kirst, it is the interpretation and uses of test scores that are validated, and not the tests themselves

It would be more correct to speak of validity of the uses we make of test scores, or speak of test scores as valid indicators or measures of particular ability Llowever,

it can be quite reasonable to talk about the validity of a test if only an interpretation

or use has already beon adopled explicitly or imphcilly

Scoond, validity is a malter of degree, and it may ehange over time as the interpretation/use develop and as new evidence accumulates Since a particular test score can never provide @ perfectly accurale measure of a giver ability, and the validity of interpretation and use always depends on the logic of the interpretive argument and the strength of the evidence provided in support of this argument, we

Trang 30

can never prove that cur interpretation and use are valid Thus, it is better to provide evidence that the intended interpretation and wse are more plausible than other

interpretation that might, be offered

Third, validity is always specific to a particular use or interpretation When a test is developed, we always have a particular set of interpretation and use in mind ‘These intended interpretation will depend on how the construct or ability to be measured is defined The way we define a given ability may be different from the way it might

be defined for different purposes, or different groups of test-takers ‘I'hus, the scores

of a particular test will not necessarily be appropriately for other situations or other

purposes

Fourth, validity is a unilary concep! We oficn hear about different typos of validity

such as content validity, concurrent validity, predictive validity or construct

validity However, as defined above, validity is a single quality of the ways in which we use a particular tes Many different kinds of evidence, such as test

content analysis or correlations with other measures of ability can be provided in

support of the intended interpretation and use For this reason, the evidence needed

for validation depends un the interpretation and use and different interpretation or

use will require different kinds and different amount of evidence for their validation

Fifth, validity involves an overall evaluative judgment Since a validation argument

typically includes several parts and is supported by different kinds of evidence,

none of which by itself is sufficient enough to justify the intended inferences and

uses of a particular test Thus, when evaluating the validity of inferences and uses, it

is important to consider the itlerprelive argument in its entirely, as well as all the

supporting evidence

In summary, with the aspects described above about validity, investigating the

validity of test use, which is called validation, can be seen as the process of building

an interpretive argument and collecting evidence in support of that argument (Kane,

Trang 31

1992, 2006, 2013) This validation approach was termed as argument-based approach to validation by Kane (1992) and is widely applied in the practice of validation nowadays,

1.3 Argument-based approach to validation

Kane (1992, 2006, 2013) recommends thai researchers follow an argument-based approach to make the task of validating inferences from test scores both scientifically sound and manageable In this approach, the validator builds an

argument, hal focuses on defending the inferpretation and use of tes scores for a

particular purpose and is based on empirical evidence to support the particular use According to Kane (2006), validation consists of two types of arguments, an

inlerprelive argument and a validity argument The interpretive argument is bunt

‘upon a number of inferences and assumptions that are meant to justify score

interpretation and use whereas the validity argument provides an evaluation of the

interpretive argument in terms of how reasonable and coherent it is as well as how plausible the assumptions are (Cronbach, 1988) ‘[o be more specific, the establishment of an interpretive argument involves (1) the determination of the

milerences based on test scores, (2) the articulation of assumptions, (3) the decision

on the sources of evidence that can support or refute those inferences, (4) collection

of appropriate data and (5) analysis of the evidence In order for the interpretation or

‘use of test scores to be valid, all of the inferences and assumptions wherent in the interpretation or use of test scores have to be plausible

Kane (1992, 2006, 2013) cites Toukmin’s (1958, 2003) framework as a guide for applying his approach to validation Toulmin’s (1958) framework essentially requires thal a chain of reasoning be established that is able to build a case towards

a conclusion, which in this case would be to determine the plausibility and reasonableness of score inlerprelation and use Figure 2.1 shows Toulnin’s (1958, 2003) argument structure, which is built on several components, including the grounds, claim, warrant, backing, and rebuttal

Trang 32

Backing |_Rebuttat

Warrants mt

Figure 2.1; Model of Toulmin’s argument structure (1958, 2003)

In terms of test score interpretation and use, the claim of an argument is the

conclusion drawn about a test-taker based on test performance whereas the grounds serve as the data or observations upon which the claim is based upon An example can be illustrated by the case given by Mislevy et al (2003) One may make the

claim that the student’s English speaking abilities are inadequate for studying in an English-medium university based on the grounds/observation that the student did

not perform well in the final oral examination in an intensive English class The

performance that the teacher observed when the student spoke in front of the class

on the assigned topic is characterized by hesitations and mispronunciations

In the interpretive argument underlying the claim about the student’s readiness for

university study, the inference linking the grounds to the claim is not given and therefore justification is needed in the form of a warrant (or assumption) Figure 2.1 shows the interpretive argument as consisting of one inference, which is authorized

by a warrant The warrant in Toulmin’s model is regarded as a rule, principle, or established procedure that is meant to provide justification for the inference

connecting the grounds to the claim In the example provided above, the warrant is the generally held principle that hesitations and mispronunciations are

characteristics of students with low levels of English speaking ability, who would

have difficulty at an English-medium university

Trang 33

The warrants in turn neod backing (or evidence) which comes in the form of theories, research, data, and experience In relation to the example provided above, backing mighL be drawn from (eacher’s laiing and previous experience with non- native speakers at an English-medium university Stronger backing could be obtained by having the student’s speaking performance rated by another teacher and

then showing the agreement between the two raters

Finally, a rebuttal also acts as a link between the ground and claim but serves to

weaken the initial argument by providing evidence or possible explanation that may call into question the warrant Going back to the previous example, a possible

rebuttal ma

y be that the assigned topic for the oral presentation required the student

to use highly technical and unfamiliar vocabulary This rebuttal would scrve to

weaken the inference connecting the grounds — that the oral presentation contained

many hesiations and mispronunciations - and the claim that the student’s speaking,

ability was at a level that would not allow him to succeed at an English-medium

interpretation and use of test scores, it is necessary to evaluate the clarity,

coherence, and completeness of the interpretive argument and to evaluate the

plausibility of cach of the ferences and assumptions in the interpretive argument The proposed interpretation and use of the test scores are supposed to be explicitly

stated in terms of a network of inferences and assumptions, and then the plausibility

of the watranls [or these inferences cau be evaluated using relevant evidence The

validation can be simple if the interpretation and use are simple and limited, involving a few plausible inferences Tn cortrast, if the interpretation and use are

Trang 34

ambitious, involving a more extensive network of inferences, the validation will need more evidence that is extensive

‘There are some advantages of this approach summarized by Kane (1990, 2006) First, the argument-based approach can be applied to any types of test interpretation

or use It does not discourage the development of any kind of interpretation 1t does not preclude the use of any kind of data collection technique in developing a measurement procedure Tt does not identify any kind of validity evidence as being generally preferable to any other kind of validity evidence It suggests that the interpretation be stated as clearly as possible and that the validity evidence should

‘be consistent, with the trerpretation

Second, the argument-based approach (o validation provides definite guidelines for systemativally evaluating the validity of proposed interpretation and use of test scores In ather words, this approach provides the researcher with a clear place to

begin and a direction to follow and as a result, this helps lhe researcher to focus

serious attention on validation

‘Third, in the argument-based approach, the evaluation of the interpretive argument does not lead to any absolute decision about validity but it does provide a way ta

measure progress (Kane, 2006) Tn other words, this approach views the validation

as an on-going and critical process instead of just answering either “valid” or

“invalid” Because the most problematic inferences and their supporting

assumplions are checked and are cither supported by the evidence or al least less

problematic, the reasonableness of the interpretive argument as a whole can

improve

Fourth, the argument-based approach to validation may increase the chance that

yesearch on validity will lead io improvements in measurement procedures Since

the argument-based approach focuses attention on specific parts of the interpretive argument and on specific aspects of measurement procedures, evidence which

indicates the existence of a problem such as inadequate coverage of content or the

Trang 35

presence of systematic orors, helps to suggest ways to solve the problem and thereby, to improve the procedure

As stated above, applying the argument-based approach to validation can help a

researcher work on any types of interpretation or use of test scores as long as the

interpretation is stated as clearly as possible and the validity evidence should be consistent with the interpretation With regards to the interpretation and use of the

cut scores of a test, before any inferences and assumptions that help to structure the

interpretive argument are stated, it is necessary to take into consideration important

aspects related to standard setting and elements needed for the application of the

pre-established cul scores for a listening proficiency test The following part will

present the issues related to standard setting and the framework for analysing a

listening test

2 Standard setting for an English proficiency test

2.1 Definition of standard setting

Probably the most difficult and controversial part of testing is standard setting, In

this study, standard setting refers to the establishment of cut scores for a test, Le.,

determining the points on the score scale for separating examinees into performance

categories such as pass/fail Ilowever, in order to have a complete understanding of

the concept, iL is important to get familiarized with the theorelical foundations of the

term

Cizek (1993) suggests an claborate and theorctically grounded definition of standard setting He defines standard setting as “the proper following of a described,

yational syslemn of rules or provedures resultmg in the assignment of a number Lo

differentiate between two or mare states or degrees of performance” (Cizek, 1993, p.10) This definition highlights the procedural aspect of standard setting and draws

on the legal framework of duc process and tachtional definitions of measurement

However, the definition suggested by Cizek (1993) suffers from at least one deficiency in that it addresses only one aspect of the legal principle known as due

15

Trang 36

process - that is, a process that is clearly articulated in advance, is applicd uniformly, and includes an avenue for appeal

In contrast to the procedural aspect of due process is the substantive aspect, which

centers on the results of the procedure The substantive due process demands that

the procedure lead to a decision or result that is fundamentally fair However, the

notion of faimess is, to some extent, subjective The aspect of fundamental faimess

is related to what has been called the “consequential basis of test use” in Messick’s

(1989, p.84) explication of the various sources of evidence in support for the use

and interpretation of the test scares

Kane (1994) provides another definition of standard setting that highlights the

conceptual nature of the endeavor He states thal “it is useful to draw a distmetion

between the passing score, defined as a point on the score scale, and the

performance standard, defined as the minimally adequate level of performance for

some purpose The performance slandard is the conveplual version of the dasired level of competence, and the passing score is the operational version” (Kane, 1994, 1.426) For this reason, as stated, this study uses the term “standard setting” and

“out scores” imlerchangeably

Another convepl, which is a key concep! underlying Kane’s definition of slandard

setting, is the concept of inference As discussed in the previous part, an inference is the interpretation, conclusion, or meaning that is intended to make about an

examinee’s underlying, unobserved level of knowledge, skill, or ability Thus, from

this perspective, validity refers to the accuracy of the inferences made about the

examinee, usually based on observations of the examinee’s performance The

assumption underlying the inference of standard selling is thal the passmg scores

creates meaningful categories that distinguish between individuals who meet some

performance standard and those who do not

Thus, for this study, the primacy of test purpose and the intended inference or test score interpretation is essential to understanding the definition of standard setting

Trang 37

To validate the standard setting or the establishment of the cut scores is to evaluate the accuracy of the inferences that are made when examinees are classified based on

application of the cut scores

Finally in wrapping up the definition of standard setting it is important to note

what standard setting is not According to Cizek and Bunch (2007:18), “standard setting does not seek to find “true” cut scores that separate real, unique categories

on a contimious competence” since there is no external truth for all the things we

care about, no set of minimum competences that are necessary and sufficient for life success, then all standard setting is judgmental Cizek and Bunch (2007) also state

that Lo some degree, because slandard setimg necessarily involves human opiions

and values, it can be also scen as a combination of technical, psychometric methods

and policy making Seen in this way, standard setting can be seen as a procedure

that enables participants lo bring their judginerits by using a spccilied method in

such a way as to translate the policy positions of authorizing entities into locations

ona scare scale

2.2 Overview of standard-setting methods

In recent years, more attention has been paid to establishing the oredibility of

existing standard-setting methods and investigating new methods Many of the new

approaches have been developed to present judges with meaningful activities and to better accommodate the changing nature of assessments Cizek & Bunch (2007)

summarize 18 different methods of cul-score establishment However, one common

element of all these methods is that they involve, to one degree or another, human

beings expressing informed judgments based on the best evidence available to them,

and these judgments are suumuarized in some sysicmatic way, typically with the aid

of a mathematical model, to produce one or more cut scores Cizek & Bunch (2007)

state thal each standard methad combines art and science and different methods

may yield different results For this reason, it is hard to say that the cut scores established by one particular method are always superior to those set by another

Trang 38

anethod, However, it can be stated that some methods can be better suited to vertain

types of tests or ci

LUambleton and

ircumstances

Pitoniak (2006) classify standard-setting methods into four

categories: (1) methods that involve review of test items and scoring rubrics, (2) anethods that inv

examinee work;

olve review of examinees, (3) methods that involves looking at

and (4) methods that involve panelist review of score profiles

Table 2.1 summarizes standard-setting methods, organized by type of rating

Angoif Panelists estimate the probability that the borderline examinee will

answer each mullipile-choice item conectly

Extended Angoff | This method is similar to Angoff for multiple-choice items, cxecpt that and related the panelists rate polytomous items (or with the analytic method, items methods within a simulation specifically) and indicale the umber of score

points (or average score of 100 borderfine examinees) that the borderline examinee will obtain

Ebel Panelists estimate each item along two dimensions — difficulty and

rolevanes For cach combination of the dimensions, panclists estimals the percentage of items that the bordertine examinee will answer correctly

think the bordering examinee would be able to mile out as incorrect; the reciprocal of the number of distractors not ruled out, plus one (for the comect answer), is taken as the probability that the borderline

cxamince will answer the item comcetly by resorting to random

Trang 39

selection among the remaining answer choices

Panelists make a yes/no rating for cach item in terms of whether every examinee (not just the borderline examinee) should be able to answer

the item coreetly

In the bookmark method, panelists review test items that are ordered according to their difficulty parameter and place a bookinark where the

another group whose meuibers are clearly below that standard

Borderline group | Raters review lists of candidates and identuiv those that they view as

bordertine

Methods involving ratings of examinee work

Item by item Panelists review, item by item, samples of candidate responses to items

(paper selection) | in the test representing each test score and select the papers they view

as representing the borderline cxaminee

Holistic (body of | Panelists review, holistically, a sample of entire candidate test booklets work, booklet | and place them in different performance categories

classification,

Trang 40

classification)

Methods involving ratings of score profiles

Jadgmental Panelists review score profiles across exercises making up a policy capturing | performance assessment and assign each score profile to a proficiency

Beuk Panelists provide estimates of the expected percent correct for the

borderline examinees and the expected passing rate for the examinee population

De Gruijter Panelists provide estimates of the expected percent comect for the

borderline examinees and the expected failing rate for the examinee poputation, phus an estimals of uncerlamly for both values

Tiêu đề	Research on Validating Cut Points of Listening Test Results in English Competency Assessment from Level 3 to Level 5 According to the 6-Level Foreign Language Framework for Vietnam
Người hướng dẫn	Prof. Nguyễn Hũa, Prof. Fred Davidson
Trường học	University of Languages and International Studies, Vietnam National University
Chuyên ngành	English Language Testing
Thể loại	Thesis
Năm xuất bản	2018
Thành phố	Hanoi

Định dạng
Số trang	206
Dung lượng	2,87 MB