Bridges that represent inferences linking components in performance assessment adapted from Kane et al., 1999 92.3 Evidence to build the validity argument for the test 15 5.1 Generaliza
Trang 1TABLE OF CONTENT
Statement of authorship i
Table of content ii
List of figures iv
List of tables v
Acknowledgments vi
Abstract vii
CHAPTER 1 INTRODUCTION 1
1.1 INTRODUCTION TO TEST VALIDITY 1
1.2 THE STUDY 2
1.3 SIGNIFICANCE OF THE STUDY 3
CHAPTER 2 LITERATURE REVIEW 5
2.1 STUDIES ON VALIDITY DISCUSSION 5
2.1.1 The conception of validity in language testing and assessment 5
2.1.2 Using interpretative argument in examining validity in language testing and assessment 6
2.1.3 The argument-based validation approach in practice so far 10
2.1.4 English placement test (EPT) in language testing and assessment 16
2.1.5 Validation of an EPT 16
2.1.6 Testing and assessment of writing in a second language 18
2.2 GENERALIZABILITY THEORY (G-THEORY) 20
2.2.1 Generalizability and Multifaceted Measurement Error 21
2.2.2 Sources of variability in a one-facet design 21
2.3 SUMMARY 22
CHAPTER 3 METHODOLOGY 23
3.1 RESEARCH DESIGN 23
3.2 PARTICIPANTS 23
3.2.1 Test takers 24
3.2.2 Raters 25
Trang 23.3 MATERIALS 25
3.3.1 The English Placement Writing Test (EPT W) and the Task Types 25
3.3.2 Rating scales 25
3.4 PROCEDURES 26
3.4.1 Rater training 26
3.4.2 Rating 28
3.4.3 Data analysis 28
3.5 DATA ANALYSIS 28
3.5.1 To what extent is test score variance attributed to variability in the following: a task?; b rater? 28
3.5.2 How many raters and tasks are needed to obtain the test score dependability of at least 85? 29
3.5.3 What are vocabulary distributions across proficiency levels of academic writing? 29 CHAPTER 4 RESULTS 31
4.1 RESULTS FOR RESEARCH QUESTION 1 31
4.2 RESULTS FOR RESEARCH QUESTION 2 33
4.3 RESULTS FOR RESEARCH QUESTION 3 38
CHAPTER 5 DISCUSSION AND CONCLUSIONS 40
5.1 GENERALIZATION INFERENCE 40
5.2 EXPLANATION INFERENCE 42
5.3 SUMMARY AND IMPLICATIONS OF THE STUDY 43
5.4 LIMITATIONS OF THE STUDY AND SUGGESTION FOR FUTURE RESEARCH 43
REFERENCES 45 APPENDIX A EPT W RATING SCALE QUYẾT ĐỊNH GIAO ĐỀ TÀI (Bản sao)
Trang 3LIST OF FIGURES
Number of
2.1 An illustration of inferences in the interpretative argument
2.2 Bridges that represent inferences linking components in
performance assessment (adapted from Kane et al., 1999) 92.3 Evidence to build the validity argument for the test 15
5.1 Generalization inference in the validity argument for the PT W with 2 assumptions and backing 405.2 The explanation inference in the validity argument for the PT test with 1 assumption and backing 42
Trang 4LIST OF TABLES
Number
2.1
Summary of the inferences, warrants in the TOEFL validity
argument with their underlying assumptions (Chapelle et al.,
2010, p.7)
12
2.2 A framework of sub-skills in academic writing (McNamara,1991) 193.1 Texts and word counts in the two levels of the EPT sub-corpora 304.1 Variance components attributed to test scores 32
4.3 Distribution of vocabulary across proficiency levels 38
Trang 5I would like to express my deeply sincere appreciation for my supervisor, Dr
Vo Thanh Son Ca who inspired and gave me her devoted instruction throughout theperiod until the project work was conducted From the onset, my research topic wasquite wide, but with her support and guidance, I have learned how to combine theoryand practice in use Thanks to her instruction and willingness to motivate and guide
me with many questions and comments, I came to be deeply aware of the importantrole of doing research in this area of language testing and assessment More thanthat, I welcomed her dedicated and detailed support since her quick feedback andcomments on my drafted work I mean that she observed thoroughly every step of
my work, helped me make significant improvements Without my supervisor'sdedicated support, the research would not have been completed Moreover, I wouldlike to take this chance to thank my family and friends, who always take care, assistand encourage me I could not have completed my dissertation, without the support
of all these marvelous people
Trang 6Foreign or second language writing is one of the important skills in languagelearning and teaching, and universities use scores from writing assessments to makedecisions on placing students in language support courses Therefore, in order for theinferences based on scores from the tests are valid, it is important to build a validityargument for the test This study built a validity argument for the English PlacementWriting test (EPT W) at BTEC International College Danang Campus Particularly,this study examined two inferences which are generalization and evaluation byinvestigating the extent to which tasks and raters attributed to test score variability,and how many raters and tasks are needed to get involved in assessment progress toobtain the test score dependability of at least 85, and by investigating the extent towhich vocabulary distributions were different across proficiency levels of academicwriting To achieve the goals, the test score data from 21 students who took twowriting tasks were analyzed using the Generalizability theory Decision studies (D-studies) were employed to investigate the number of tasks and raters needed toobtain the dependability score of 0.85 The 42 written responses from 21 studentswere then analyzed to examine the vocabulary distributions across proficiency levels.The results suggested tasks were the main variance attributed to variability of testscore, whereas raters contributed to the score variance in a more limited way Toobtain the dependability score of 0.85, the test should include 14 raters and 10 tasks
or 10 raters and 12 tasks In terms of vocabulary distributions, low level studentsproduced less varied language than higher level students The findings suggest thathigher proficiency learners produce a wider range of word families than lowerproficient counterparts
Trang 7CHAPTER 1 INTRODUCTION
This chapter presents the introduction to test validity and the purpose of thisthesis The chapter concludes with the significance of this thesis
1.1 INTRODUCTION TO TEST VALIDITY
Language tests are needed to measure students' ability in English in collegesettings One of the most common tests developed is entrance tests or placement testswhich are used to place students into appropriate language courses Thus, the use oftest scores cannot be denied as a very important role The placement test at BTECInternational College is used as an example for building this research study andhelping to build up validity argument with further research purposes It will havecertain impact on students, administrators, and instructors at BTEC InternationalCollege Da Nang Campus First, the test score helps students know whether they areready for collegiate courses taught in English Second, the test score helpsadministrations at English programs place students into appropriate English languageuse class level The information on students' ability would help instructors with theirinstruction or lesson planning Besides, students would also value the importance oflanguage use ability for their success in college so that they pay more attention to theirimprovement of academic skills
With the importance role of entrance tests, test validity is the focus of thisstudy Test validity is the extent to which a test accurately measures what it issupposed to be measure and validity refers to the interpretations of test score entailed
by proposed uses of tests which is supported by evidence and theory (AmericanEducational Research Association, American Psychological Association, & NationalCouncil on Measurement in Education, 1999) In other words, validation is a progress
in which test developers and/or test users gather evidence to provide “a soundscientific basis” for interpreting test scores
Trang 8Validity researchers emphasize on quality, rather than quantity of evidence inorder to support validity interpretation Evidence to support can be one of the fourcategories: evidence based on content of tests, evidence based on response processes,evidence based on relations to other variables, evidence based on consequences oftesting (American Educational Research Association, American PsychologicalAssociation, & National Council on Measurement in Education, 1999)
In order to provide those four categories of evidence, we need to conductdifferent kinds of research studies: the Achieve alignment method is used forproviding evidence based on content of tests (Rothman, Slattery, Vranek, & Resnick,2002); evidence based on response processes is provided with the help of thecognitive interview method (Willis, 2005; Miller, Cheep, Wilson, & Padilla, 2013);the predictive method is conducted to provide evidence based on relations to othervariables; evidence based on consequences of testing can be backed up with the use ofargument-based approaches for a test's interpretative argument and validity argument(Kane, 2006)
1.2 THE STUDY
BTEC International College - FPT University administers its placement test(PT) every semester to incoming students to measure their English proficiency foruniversity studies The test is composed of four skills: reading, listening, speaking,and writing Only writing skill is the focus of this study
This study developed a validity argument for the English Placement Writingtest (EPT W) at BTEC International College - FPT University Developed and firstadministered in Summer 2019, the EPT W is intended to measure test takers' writingskills necessary for success in academic contexts (See Table 1.1 for the structure ofthe EPT W.) Therefore, building a validity argument for this test is very important It
is helpful for many educators and researchers to understand the consequences ofassessment Particularly, the objectives of this study are investigating: 1) the extent towhich tasks and raters attributed to score variability; 2) how many tasks and raters areneeded to get involved in assessment to obtain the
Trang 9test score dependability of at least .85; and 3) theextent to which vocabulary distributions are different acrossproficiency levels of academic writing
Table 1.1 The structure of the EPT WTotal test time 30 minutes
Number of parts 2
Part 1
Task content Write a paragraph using one tense on any familiar topics For
example: Write a paragraph (100-120 words) to describe an event you attended recently
Part 2
Task content Write a paragraph using more than one tense on a topic that
relates to publicity
For example: Write a paragraph (100-120 words) to describe
a vacation trip from your childhood Using these clues:
Where did you go? When did you go? Who did you go with?What did you do? What is the most memorable thing? Etc
The EPT W uses a rating rubric to assess test takers' performance Theappropriateness of a response is based on a list of criteria, such as task achievement,grammatical range and accuracy, lexical resource, coherence and cohesion (seeAppendix A)
1.3 SIGNIFICANCE OF THE STUDY
The results of the study should contribute theoretically to the field of languageassessment By providing evidence to support inferences based on the scores of theEPT W test, this current study attempts to provide the discussion of test validity in thecontext of academic writing
Trang 10Practically, the results should contribute to the possible use of quantity of tasksand raters to assess writing ability The findings of this study should provide anunderstanding of how different components affect variability of test scores and thekind of language elicited This would offer guidance on choosing an appropriate taskfor measuring academic writing
Trang 11CHAPTER 2 LITERATURE REVIEW
This chapter discusses previous studies on validity and introducesgeneralizability theory (G-theory) that was used as background for data analyses
2.1 STUDIES ON VALIDITY DISCUSSION
2.1.1 The conception of validity in language testing and assessment
What is validity?
The definition of validity in language testing and assessment could be given
in three main time periods
First, Messick (1989) stated that “validity is an overall evaluative judgment ofthe degree to which empirical evidence and theoretical rationales support theadequacy and appropriateness of interpretations and actions based on test scores orother modes of assessment” (p.13) Then, Messick's view about validity wassupported and found an official recognition in AERA, APA, and NCME (1985)which describes validity as follows:
The concept refers to the appropriateness, meaningfulness, and usefulness ofthe specific inferences made from test scores Test validation is the process ofaccumulating evidence to support such inferences A variety of inferences may bemade from scores produced by a given test, and there are many ways ofaccumulating evidence to support any particular inference Validity, however, is aunitary concept (p 9)
Second, the definition of validity is well-explained and elaborated byBachman (1990) This definition helps to confirm that the inferences made on thebasis of test scores and their uses are the objects of validation rather than the teststhemselves in concert with the Messick's view According to Bachman, validity has acomplex nature comprising of a number of aspects including content validity,construct validity, concurrent validity, and consequences of test use AERA et al
Trang 12(1999) restated: “validity refers to the degree to which evidence and theory supportthe interpretations of test scores entailed by proposed uses of tests” (p 9).
Third, through an examination of all of these statements about validity(AERA et al., 1999; APA, 1985; Bachman, 1990; Messick, 1989), Kane (2001)revealed four important aspects of this current view First, validity involves anevaluation of the overall plausibility of a proposed interpretation or use of testscores Second, consistent with the general principles growing out of constructvalidity, the current definition of validity by Kane (AERA et al., 1999; Messick,1989) incorporates the notion that the proposed interpretations will involve anextended analysis of inferences and assumptions which includes both a rationale forthe proposed interpretation and a consideration of possible competing interpretations.The resulting evaluative judgment reflects the adequacy and appropriateness of theinterpretation and the degree to which the interpretation is adequately supported byappropriate evidence Fourth, validity is an integrated, or unified, evaluation of theinterpretation; and it is not simply a collection of techniques, or tools
Different aspects of validity
A number of aspects of validity have been examined on recognition of thecomplexity of validity and its importance in test evaluation (Bachman, 1990; 2004;Bachman & Palmer, 1996; Brown, 1996) Both Bachman (1990) and Brown (1996)agreed on the three main aspects of validity: content relevance and content coverage(or content validity), criterion relatedness (or criterion validity), and meaningfulness
of construct (or construct validity) In addition, Brown (1996) suggested theexamination of standard setting or the appropriateness of a cut-point as anotherimportant aspect of validity
2.1.2 Using interpretative argument in examining validity in language testing and assessment
The argument-based validation approach in language testing and assessmentviews validity as an argument construed by an analysis of theoretical and empiricalevidences instead of a collection of separately quantitative or qualitative evidences
Trang 13(Bachman, 1990; Chapelle, 1999; Chapelle, Enright, & Jamieson, 2008, 2010; Kane,
1992, 2001, 2002; Mislevy, 2003) One of the widely-supported argument-basedvalidation frameworks is to use the concept of interpretative argument (Kane, 1992;2001; 2002) This approach is clearly defined by Kane (1992) as follows:
The argument-based approach to validation adopts the interpretativeargument as the framework for collecting and presenting validity evidence and seeks
to provide convincing evidence for its inferences and assumptions, especially itsmost questionable assumptions (p.527)
The inferences in the interpretative argument consist of domain description,evaluation, generalization, explanation, extrapolation, and utilization They arerepresented by arrows that connect the grounds (the target domain and observations
at the bottom), intermediate claims or implications (the observed score, expectedscore, and construct in the middle), and claim or conclusion (target score and test use
at the top) Figure 2.1 shows the inferences in the interpretative argument
Trang 14Target Domain
Figure 2.1 An illustration of inferences in the interpretative argument (adapted
from Chapelle et al 2008)
In the article about how to put the argument-based approach into practice,Kane (2002) summarized the common description of an interpretative argumentagreed by various testing researchers (Crooks, Kane, & Cohen, 1996; Kane, 1992;Shepard, 1993) It states that “an interpretative argument is known as a network ofinferences and supporting assumptions leading from scores to conclusions anddecisions” (Kane, 2002, p 231)
Structure of an interpretative argument
Kane (1992) argued that multiple types of inferences connect observations andconclusions The idea of multiple inferences in a chain of inferences and implications
is consistent with Toulmin, Rieke, and Janik's (1984) observation:
Construct
Explanation
Domain description
Trang 15in practice, of course, any arguments is liable to become the starting pointfor a further argument; this second argument tends to become the starting point for athird argument, and so on In this way, arguments become connected together inchains (p 73).
Kane et al (1999) illustrated an interpretive argument that might underlie aperformance assessment It consists of six types of inferential bridges These bridgesare crossed when an observation of performance on a test is interpreted as a sample ofperformance in a context beyond the text The Figure 2.2 shows the illustration ofinferences in the interpretive argument
Figure 2.2 Bridges that represent inferences linking components in performance
assessment (adapted from Kane et al., 1999)
Similar to Chapelle et al (2008), Kane et al.'s (1999) interpretive argumentconsists of seven parts, each of which is linked to the next one by an inference First,
in the interpretive argument, domain description links performances in the targetdomain to the observations of performance in the test domain, which has beenidentified based on the test purpose The observation of test performance revealsrelevant knowledge, skills, and abilities in situations representative of those in thetarget domain
The second link - evaluation - is an inference from an observation ofperformance to a score, and is based on the assumptions about the appropriateness andconsistency of the scoring procedures and the conditions under which the performance
is obtained Kane et al (1999) described “the criteria first used to score theperformance are appropriate and have been applied as intended and second theperformance occurred under conditions compatible with the intended scoreinterpretation” (p 9) In language assessment, assumptions underlying the evaluation
Domain description Evaluation Generalization Explanation Extra P olation Utilization
Target score TestUse
Trang 16inference are investigated through research on raters, scoring rubrics, and scales (e.g.,Mc-Namara, 1996; Skehan, 1998), but in addition to these aspects of the scoringprocess, test administration conditions affect evaluation.
The third link - generalization - is from an observed score on a particularmeasure to a universe score, or the score that might be obtained from performances onmultiple tasks similar to those included in the assessment This link is based on theassumptions of measurement theory Generalization refers to the use of the observedscore as an estimate of the score that would be expected of a test taker across paralleltasks and test forms
The fourth link - explanation is an inference from the expected score toconstruct of academic language proficiency The construct refers to what accounts forconsistencies in individuals' performances and fits logically into a network of relatedconstructs and theoretical rationales
The fifth link - extrapolation is from the universe score to a target score, which
is essentially an interpretation of what a test taker knows or can do, based on theuniverse score This link relies on the claims in an interpretative argument and theevidence supporting these claims Extrapolation refers to the inference that is madewhen the test takers' expected score is interpreted as indicative of performance andscores that they would receive in the target
The last link - utilization inference links the target score to test use, whichincludes decisions about admissions and course recommendations
2.1.3 The argument-based validation approach in practice so far
Several recent validation studies in language testing and assessment haveattempted to take the argument-based approach into practice, three of which arechosen to be illustrated here (Chappelle et al., 2008; Chapelle, Jamieson, &Hegelheimer, 2003; Chapelle et al., 2010) The first one carried out by Chapelle et al.(2003) exemplified the employment of the concept of test purpose (Shepard, 1993) toidentify sources of validity evidence and the framework of test usefulness (Bachman
& Palmer, 1996) Their validity argument was structured On the other hand, the other
Trang 17two illustrate the application of the structure of an interpretative argument to guide thevalidation process and to build a validity argument for the tests under examination.
Building a validity argument for the TOEFL iBT
Chapelle et al (2008) employed and systematically developed Kane'sconceptualization about an interpretative argument in order to build a validityargument for the TOEFL iBT test The whole project comprises of detaileddescriptions about the interpretative argument for the TOEFL, a collection of relevanttheoretical and empirical evidences on different aspects of validity of the test, and aconstruction of the validity argument for the TOEFL The main components of theinterpretative argument and the validity argument are illustrated in Table 2.1 andFigure 2.3 respectively
Trang 18Table 2.1 Summary of the inferences, warrants in the TOEFL validity argument
with their underlying assumptions (Chapelle et al., 2010, p.7)
Inference Warrant Licensing the
Inference
Assumptions Underlying
InferencesDomain
description Observations of performance on
the TOEFL reveal relevantknowledge, skills, and abilities
in situations representative ofthose in the target domain oflanguage use in the English-medium institutions of highereducation
1 Critical English language skills,knowledge, and processes neededfor study in English-mediumcolleges and universities can beidentified
2 Assessment tasks that requireimportant skills and arerepresentative of the academicdomain can be simulated
Evaluation Observations of performance on
TOEFL tasks are evaluated toprovide observed scoresreflective of targeted languageanilities
1 Rubrics for scoring responses areappropriate for providing evidence
of targeted language abilities
2 Task administration conditionsare appropriate for providingevidence of targeted languageabilities
3 The statistical characteristics ofitems, measures, and test forms areappropriate for norm- referenceddecisions
Generalization Observed scores are estimates
of expected scores over therelevant parallel versions oftasks and test forms and
1 A sufficient number of tasks areincluded in the test to provide stableestimates of test takers'performances
Trang 19Inference Warrant Licensing the
Inference
Assumptions Underlying
Inferencesacross raters
2 Configuration of tasks onmeasures is appropriate for intendedinterpretation
3 Appropriate scaling and equatingprocedures for test scores are used
4 Task and test specifications arewell defined so that parallel tasksand test forms are created
Explanation Expected scores are attributed
to a construct of academiclanguage proficiency
1 The linguistic knowledge,processes, and strategies required tosuccessfully complete tasks varyacross tasks in keeping withtheoretical expectations
2 Task difficulty is systematicallyinfluenced by task characteristics
3 Performance on new testmeasures relates to performance onother test-based measures oflanguage proficiency as expectedtheoretically
4 The internal structure of the testscores is consistent with atheoretical view of languageproficiency as a number of highlyinterrelated components
5 Test performance varies
Trang 20Inference Warrant Licensing the
Inference
Assumptions Underlying
Inferencesaccording to the amount and quality
of experience in learning English.Extrapolation The construct of academic
language proficiency asassessed by TOEFL accountsfor the quality of linguisticperformance in English-medium institutions of highereducation
Performance on the test is related toother criteria of languageproficiency in the academic context
Utilization Estimates of the quality of
performance in the medium institutions of highereducation obtained from theTOEFL are useful andappropriate curricula for testtakers
English-1 The meaning of test scores isclearly interpretable by admissionsofficers, test takers, and teachers
2 The test will have a positiveinfluence on how English is taught
Based on the interpretative argument for the TOEFL, all the relevant
evidences are collected and organized in order to build the validity argument for the test as being seen in Figure 2.3 below
Trang 21CONCLUSION: The test scores reflects the ability of the test taker to use and understand English as it is spoken, and heard in college and university settings The score is useful for aiding in admissions and placement decisions and for guiding English-language instruction.
1 Educational Testing Service has produced materials and held test user
information sessions.
2 Educational Testing Service has produced materials and held information
sessions to help test users set cut scores.
3 The first phrases of a washback study have been completed.
Results indicate positive relationships between test performance and students'
academic placement, test takers' self-assessments of their own language
proficiency, and instructors' judgments of students' English language
proficiency.
1 Examination of task completion processes and discourse supported the
developed the development of and justification for specific tasks.
2 Expected correlations were found among TOEFL measures and other tests.
3 Correlations were found among measures within the test and expected factor
structure.
4 Results showed expected relationship with English learning.
1 Results from reliability and generalizability studies indicated the number of
tasks required.
2 A variety of task configurations was tried to find a stable configuration.
3 Various rating scenarios were examined to maximize efficiency.
4 An equating method was identified for the listening and the reading
measures.
5 An ECD process yielded task shells for producing parallel tasks.
1 Rubri cs were developed, trialed and revised based on expert consensus.
2 Multiple task administration conditions were developed, trialed, and revised
based on expert consensus.
3 Statistical characteristics of tasks and measures were monitored throughout
the test development and modifications in tasks and measures were made as
needed.
1 Appl ied linguistics identified academic domain tasks Research showed
teachers and learners thought these tasks were important
2 Applied linguistics identified language abilities required for academic tasks.
3 A systematic process of task design and modeling was engaged by experts.
1
Utilization
I
Explanation Expected score
Trang 222.1.4 English placement test (EPT) in language testing and assessment
What is EPT?
Placement test is a widespread use of tests within institutions and its scope ofuse varies in situations (Brown, 1989; Douglas, 2003; Fulcher, 1997; Schmitz & C.Delmas, 1991; Wall, Clapham & Alderson, 1994; Wesche et al., 1993) Regarding itspurpose, Fulcher (1997) generalized that “the goal of placement testing is to reduce to
an absolute minimum the number of students who may face problems or even failtheir academic degrees because of poor language ability or study skills” (p 1)
English as a Second/Foreign Language (ESL or EFL) placement testing iscommonly conducted at the beginning of students' studies to determine which level oflanguage study for beginners would be appropriate (Brown, 1989; Douglas, 2003),and can be put their language skills into practice in a variety of ways For example,the placement test at BTEC International College - FPT University (PT) can be usedwithin a developmental college curricula Second, PT can be used for placing students
of varying language backgrounds and skill levels in an intensive ESL/EFL program(Wesche et al, 1993) In another case, a placement test can be developed to identifyoverseas students entering an English-medium university whose language skills orabilities are insufficient for their academic life (Douglas, 2003; Fulcher, 1997) Infact, besides using one of the major intercnational tests such as TOEFL or IELTS foradmissions, many colleges and universities do some further evaluation of studentsafter their arrival on campus in order to get more precise assessment of the specificEnglish language abilities of students The test results will be used to decide whetherthe students need more English instructions or not, and which appropriate ESL/EFLcourses can be offered to meet their needs (Douglas, 2003)
2.1.5 Validation of an EPT
Several researchers have been trying to address the issue of validity inplacement testing (Brown, 1989; Fulcher, 1997; Lee, & Greene, 2007; Schmitz &Delmas, 1991; Truman, 1992; Usaha, 1997; Wall, Clapham, & Alderson, 1994) Somemajor concerns and approaches in examining validity of EPT can be summarized here
Trang 23Fulcher (1997) investigated validity of the language placement test at theUniversity of Surrey with the purpose of identifying students needing more Englishinstructions to be successful in their academic life The test was about one hour long,and consists of three parts: Essay writing, Structure of English (Grammar), andReading Comprehension He elaborated aspects of the test including how to set cut-scores for placement, how to exploit more means of statistical analyses, how todevelop parallel test forms, and how to use students' questionnaires for face validity.
Another significant issue in validating and EPT is to consider a number ofrelevant constrains as examining validity of an EPT test (Fulcher, 1997) Theseconstrains comprise: how much testing time allowed, how many examiners employed,how much money and efforts spent on carrying out pretesting and post analyses, orequating test forms and formats (economical, logistic and administrationalconstraints)
Schmitz and delMas (1991) identified the two most common inferences ininterpretation and use of placement test scores The first is that scores accuratelyrepresent a student's standing within an academic domain or dimension of learning.The second is that a certain amount of mastery within that domain is required for thestudent to succeed in a college-level course or curriculum These two inferencesreflect the essential role of placement tests which is discriminated among studentswho need to take remedial-level work from those who do not, or among those whoneed different levels of instructions
Moreover, these two main inferences are elaborated to comprise of fourpossible underlying hypotheses that should be considered in validating placement tests(Schmitz & delMas, 1991) First, the test distinguishes between masters and non-masters within an academic domain of learning Second, placement scores contribute
to the prediction of course grades in sections for which student placement wasunguided by test scores Third, placement of students according to placement test cutscores results in higher rates of course success (hit rates) than rates achieved whenplacement scores are not used (bate rates) Fourth, course success is related to other
Trang 24criteria representing desirable standards, for example, performance is subsequentcourses and cumulative grade point average (GPA) The authors continued to givesome guidelines on how to examine different validity types of a placement test based
on the four suggested underlying hypotheses for using a placement test
In brief, the review of validation studies of placement testing in general, andEPT in particular is meaningful in some of ways: introduces the general understandingabout why building up validation of PT test is needy and how validation affects manyother aspects of PT test
2.1.6 Testing and assessment of writing in a second language
Writing in a second language
In various ESL exams, writing has been described a cognitively challengingtask Raimes (1994) indicates it as “a difficult, anxiety-filled activity” (p 164) Lines(2014) took it into details: for any writing task, students need to not only draw on theirknowledge of the topic, its purpose and audience but also make appropriate structural,presentational and linguistic choices that shape meaning across the whole text
It is even more complex for second language (L2) writers Examining two studies, which compare research on first language (L1) and L2 writing, Silva(1993) observed that writing in an L2 is achieving specific rhetorical or aestheticeffects through manipulation of sentences and vocabulary and make it linguisticallydifferent in important ways from L1 writing
seventy-Testing and assessment of writing in second language
McNamara (1991) gave a framework of sub-skills in academic writing whichhelps to explore the convergence and separability of sub-skills of a writing construct
Trang 25model including grammar and lexis, cohesion andcoherence, and arrangement of ideas (see Table 2.2 below).Table 2.2 A framework of sub-skills in academic writing (McNamara, 1991)Criterion (sub-skill) Description and elements
Arrangement of Ideas and
Examples (AIE)
1 presentation of ideas, opinions, and information
2 aspects of accurate and effective paragraphing
3 elaborateness of details
4 use of different and complex ideas and efficientarrangement
5 keeping the focus on the main theme of the prompt
6 understanding the tone and genre of the prompt
7 demonstration of cultural competenceCommunicative Quality
Sentence Structure 1 using appropriate, topic-related and correct
Vocabulary (SSV) vocabulary (adjectives, nouns, verbs, prepositions,
articles, etc.), idioms, expressions, and collocations
2 correct spelling, punctuation, and capitalization (thedensity and communicative effect of errors in spellingand communicative effect of errors in word formation(Shaw & Taylor, 2008, p.44))
3 appropriate and correct syntax (accurate use of verbtenses and independent and subordinate clauses)
4 avoiding use of sentence fragments and fusedsentences
5 appropriate and accurate use of synonyms and
Trang 26This current study focuses in nonnative English learners' academic writing fortesting purposes It is related to standardized high-stakes language testing rather thanclassroom assessment Testing in general is a progress of deep emotional chordstriking in people (Gebril & Plakans, 2015) As writing has seemed to be a complexskill which is composed of grammar and lexis, cohesion and coherence, andarrangement of ideas (McNamara, 1991), this current study is conducted with thepurpose of using test score analysis to check if task design and raters have impact onstudents' performance and their use of language is successful in academic context.
2.2 GENERALIZABILITY THEORY (G-THEORY)
What is Generalization theory (G-theory)?
Generalizability (G) theory is a statistical theory about the dependability ofbehavioral measurements Cronbach, Gleser, Nanda, and Rajaratnam (1972) sketchedthe notion of dependability as follows:
The score on which the decision is to be based is only one of many scores thatmight serve the same purpose The decision maker is almost never interested in theresponse given to the particular stimulus objects or questions, to the particular tester,
at the particular moment of testing Some, at least, of these conditions of measurementcould be altered without making the score any less acceptable to the decision maker (p 79)
The strength of G theory is that multiple sources of error in a measurement can
be estimated separately in a single analysis G-theory enables the decision maker todetermine how many occasions, test forms, and administrations are needed to obtaindependable scores In the process, G theory provides a summary coefficient reflectingthe level of dependability, a generalizability coefficient that is analogous to classicaltest theory's reliability coefficient
In language assessment, Bachman (1990) described the use of g-theory toestimate the errors associated with generalization of task scores across a set oflanguage test tasks and takers Generalization is a way of conceptualizing reliabilitythat has been used in numerous validation studies for language assessments, therefore,
Trang 27assumptions underlying the generalization inference can be supported throughreliability estimates Other studies provided evidence for the generalization inference
by studying test administration conditions (Kane et al., 1999) and score equating(Kane, 2004)
2.2.1 Generalizability and Multifaceted Measurement Error
A measurement is a sample from a universe of admissible observations that adecision maker is willing to treat as interchangeable for the purposes of makingdecision from the perspective of G theory (Shavelson & Webb, 1991) The definition
of a one-facet universe is made by one source of measurement error Item is a facet ofthe measurement; the item universe would be defined by all admissible items If thedecision maker is willing to generalize from performance on one occasion toperformance on one occasion to performance on much larger set of occasions,occasions is a facet; the occasions universe would be defined by all admissibleoccasions Error is always present when a decision maker generalizes from ameasurement to behavior in the universe
A single item/ test score is unlikely to be an accurate estimate of a person'sability There are some of sources of error: occasions, items, raters, etc G-theorymakes it possible to estimate variance due to multiple sources Variance due toconstruct of interest and sources of error can be estimated
2.2.2 Sources of variability in a one-facet design
Definitions of some sources of variability
Person/ object of measurement is variability due to differences in ability onconstruct of interest
Rater facet is the extent to which items vary in difficulty
Interaction between person and item is the extent to which items are differentlydifficult for different persons (e.g., background knowledge of test takers can affectscore on a reading test.)
Trang 282.3 SUMMARY
Based on the above review of current validation studies in language testing andassessment, especially EPT in colleges and universities, I would like to investigate thevalidity of the English placement writing test (EPT W) used at BTEC InternationalCollege - Da Nang Campus, which is administered to new comers whose firstlanguage is not English Using the framework of interpretative argument for theTOEFL iBT test developed by Chapelle et al (2008), I propose the interpretativeargument for the Writing EPT W test by focusing on the following inferences:generalization, and explanation To achieve those aims, this study sought to answerthese three research questions The first two questions aimed to provide evidenceunderlying the inferences of evaluation and generalization The third question, whichinvolved in an analysis of linguistic features from the hand- typed writing record of 21passed tests, backed up the evidence for the explanation inference
1 To what extent is test score variance attributed to variability in thefollowing:
Trang 29CHAPTER 3 METHODOLOGY
This chapter first provides the information about the research design of thestudy The chapter presents knowledge about the participants including test takers,raters, materials, data collection procedures, and data analyses to answer each ofresearch questions
3.1 RESEARCH DESIGN
This study employed a descriptive design that involved collecting a set of dataand using it in a parallel manner to provide a more through approach to answering theresearch questions The qualitative data were the 21 typescripts of written exams bystudents who passed the entrance placement tests (79 out of 100 test takers did notpass that were placed into English class Level 0) The quantitative data included: 400writing scores for two writing tasks from a total of 100 test takers (each task wasscored by two raters)
First, the analyses of 400 writing scores were used to answer these threeresearch questions:
1 To what extent is test score variance attributed to variability in thefollowing:
a task?
b rater?
2 How many raters and tasks are needed to obtain the test score dependability
of at least 85?
Second, the linguistic feature analyses from 21 passing written tests were used
to answer the following research question:
3 What are vocabulary distributions across proficiency levels of academicwriting?
3.2 PARTICIPANTS
There were two categories of participants: 100 test takers and two raters