CATHIE ELDER University of Melbourne Carleton, Victoria, Australia UTE KNOCH University of Melbourne Carleton, Victoria, Australia RONGHUI ZHANG Shenzhen Polytechnic Instit
Trang 1BRIEF REPORTS
TESOL Quarterly invites readers to submit short reports and updates on their work
These summaries may address any areas of interest to TQ readers.
Edited by ALI SHEHADEH
United Arab Emirates University
JOHN LEVIS
Iowa State University
Diagnosing the Support Needs of Second Language Writers: Does the Time Allowance Matter?
CATHIE ELDER
University of Melbourne
Carleton, Victoria, Australia
UTE KNOCH
University of Melbourne
Carleton, Victoria, Australia
RONGHUI ZHANG
Shenzhen Polytechnic Institute
Shenzhen, China
This study investigates the impact of changing the time allowance for the writing component of a diagnostic English language assessment administered on a voluntary basis to fi rst year undergraduates at two uni-versities with large populations of immigrant and international students following their admission to the university The test is diagnostic in the sense of identifying areas where students may have diffi culties and there-fore benefi t from targeted English language intervention concurrently with their academic studies A change in the time allocation for the writ-ing component of this assessment (from 55–30 minutes) was introduced
in 2006 for practical reasons It was believed by those responsible for implementing the assessment that a reduced time frame would minimize the problems associated with scheduling the test and accordingly encour-age faculties to adopt the assessment tool as a means of identifying their students’ language learning needs The current study aims to explore
Trang 2how the shorter time allowance would infl uence the validity, reliability,
and overall fairness of an EAP writing assessment as a diagnostic tool
The impetus for the study arose from anecdotal reports from test
raters to the effect that, under the new time limits, students were either
planning inadequately in preparation for the task or else failing to meet
the word requirements The absence of planning time was perceived
to have a negative impact on the quality of students’ written discourse
Concerns were also expressed that the limited nature of the writing
sample made it diffi cult to provide an accurate and reliable assessment
of student’s ability to cope with the writing demands of the academic
situation
As discussed in Weigle (2002), the time allowed for test administration
raises issues of authenticity, validity, reliability, and practicality Most
aca-demic writing tasks in the real world are not generally performed under
time limits, and academic essays usually require refl ection, planning, and
multiple revisions A writing task within a reduced time frame without
access to dictionaries and source materials will inevitably be inauthentic
in the sense that it fails to replicate the conditions under which academic
writing is normally performed Moreover, unless a test task is designed
expressly to measure the speed at which test takers can answer the
ques-tion posed, rigid time limits potentially threaten the validity of score
inferences about test takers’ writing ability The limited amount of
writ-ing produced under time pressure may also make it diffi cult for raters to
accurately assess the writer’s competence On the other hand,
institu-tional constraints are inevitable on the resources available for any
assess-ment A timed essay test is certainly easier and more economical to
administer, and it can be argued that even a limited sample of writing
elicited under less than optimal conditions may be better than no
assess-ment at all as a means of fl agging potential writing needs Achieving a
balance between what is optimal in terms of validity, authenticity,
reliabil-ity, and what is institutionally feasible is clearly important in any test
situation
Research investigating the time variable in writing assessment has
produced somewhat contradictory fi ndings, perhaps because of the
dif-ferent tasks, participants, contexts, and methodologies involved and
also the differing time allocations investigated Some studies suggest
that allowing more time results in improved writing performance
(Biola, 1982; Crone, Wright, & Baron, 1993; Livingston, 1987; Younkin,
1986; Powers & Fowles, 1996), whereas others fi nd that changing the
time allowance makes no difference to performance as far as rater
reli-ability and or rank ordering of students is concerned (Caudery, 1990;
Hale, 1992; Kroll, 1990) Not all studies use independent ability
mea-sures (such as test scores from a different language test) or a
counter-balanced design that controls for extraneous effects such as task
Trang 3diffi culty and order of presentation (but see Powers & Fowles, 1996) Investigative methods also differ, with most studies looking only at mean score differences across tasks without considering the validity implica-tions of any differences in the relative standing of learners when the time variable is manipulated (but see Hale, 1992) Moreover, most stud-ies have focused on overall scores, based on a holistic scoring or perfor-mance aggregates, rather than exploring whether the time condition has a variable impact on different dimensions of performance, such as
fl uency and accuracy (but see Caudery, 1990) It is particularly impor-tant to consider these different dimensions when one is dealing with assessment for diagnostic purposes, where the prime function of the test score is to provide feedback to teachers and learners about future learning needs If changing the time allocation infl uences the nature of information yielded about particular dimensions of writing ability, this result may have important validity implications as well as practical consequences
THE STUDY
This study aims to establish whether altering the time conditions on
an academic writing test has an effect on (a) the analytic and overall (average) scores raters assigned to students’ writing performance and (b) the level of interrater reliability of the test If scores differ according
to time condition, this result would have implications for who is
identi-fi ed as needing language support, and if consistent rating is harder to achieve under one or another condition, then decisions made about individual candidates’ ability cannot be relied on Thirty students each completed two writing tasks aimed at diagnosing their language support needs For one of these tasks they were given a maximum of
30 minutes of writing time and for the other they were given 55 minutes
A fully counterbalanced design was chosen to control for task version and order effect
RESEARCH QUESTIONS
The study investigated the following research questions:
1 Do students’ scores on the various dimensions of writing ability differ between the long (55-minute) and short (30-minute) time condition?
2 Are raters’ judgments of these dimensions of writing ability equally reliable under each time condition?
Trang 4METHOD
Context of the Research
The preliminary study reported in this article was conducted in the
context of a diagnostic assessment administered in very similar forms at
both the University of Melbourne and the University of Auckland The
assessment serves to identify the English language needs of
undergradu-ate students following their admission to one or the other university and
to guide them to the appropriate language support offered on campus
The Diagnostic English Language (Needs) Assessment or DELA/DELNA
(the name of the testing procedure differs at each university) is a general
rather than discipline-specifi c measure of academic English The writing
subtest, which is the focus of this study, is described in more detail in the
Instruments section The data for the current study were collected at the
University of Auckland and analysed at the University of Melbourne
Participants
Test Takers
The participants in the study were 30 fi rst-year undergraduate students
at the University of Auckland ranging in age from 20 to 39 years old
The group comprised 19 females and 11 males All participants were
English as an additional language (EAL) students from a range of L1
backgrounds, broadly refl ecting the diversity of the EAL student
popula-tion at the University of Auckland The majority (64%) were Chinese
speakers, while other L1 backgrounds included French, Malay, Korean,
German, and Hindi The mean length of residence in New Zealand was
5.3 years
Raters
Two experienced DELNA raters were recruited to rate the essays
col-lected for the study DELNA raters are regularly trained and monitored
(see, e.g., Elder, Barkhuizen, Knoch, & von Randow, 2007; Elder, Knoch,
Barkhuizen, & von Randow, 2005; Knoch, Read, & von Randow, 2007)
Both raters had postgraduate qualifi cations in TESOL as well as rating
experience in other assessment contexts (e.g., International English
Language Testing System)
Trang 5Instruments
Tasks
To achieve a counter balanced design, two prompts were chosen for the study The topics of the essays were as follows:
Version A: Every citizen has a duty to do some sort of voluntary work Version B: Should intellectually gifted children be given special assis-tance in schools?
The task required students to write an argument essay of approxi-mately 300 words in response to these questions To help students formu-late the content of the essays, students were provided with a number of brief supporting or opposing statements, although they were asked not to include the exact wording of these statements in their essays
To ascertain that the two prompts used were of similar diffi culty, over-all ratings were compared across the 60 essays An independent samples
t test showed that the two prompts were statistically equivalent with respect
to the challenge they presented to test takers, t (58) = 0.415, p = 0.680
Rating Scale
The rating scale used was an analytic rating scale with three rating categories (fl uency, content, and form) rated on six band levels rang-ing from 1–6, where a score of 4 or less indicates a need for English lan-guage support Raters were asked to produce ratings for each of the three categories These ratings were also averaged to produce an overall score
Procedures
Data Collection
To obtain an independent measure of the students’ language ability, the students fi rst completed a screening test comprising a vocabulary and speed-reading task (Elder & Von Randow, in press) Based on these scores, the students were divided into four groups of more or less equal ability Then, to control for prompt and order effect, a fully counter bal-anced design was used as outlined in Table 1
The writing scripts were presented in random order to the raters, who were given no information about the condition under which the writing
Trang 6was produced, so as to eliminate the possibility of their taking the time
allowance into account when assigning the scores Raters have been
found in other studies (e.g., McNamara & Lumley, 1997), to compensate
candidates for task conditions which they feel may have disadvantaged
them
Data Analysis
The scores produced by the two raters were entered into SPSS (2006)
T-tests and correlational analyses were used to answer the two research
questions
RESULTS
Research Question 1 Do students’ scores on the various dimensions of
writing ability differ between the long (55-minute) and short
(30-min-ute) time condition?
Two different types of analyses were used to explore variation in
stu-dents’ scores under the two time conditions First, mean scores obtained
under each condition were compared (see Table 2 ) The means for form
and fl uency were almost identical in each time condition, whereas for
content, the long writing task elicited ratings almost half a band higher
TABLE 2
Paired Samples t Tests
Variable Mean–short SD–short Mean–long SD–long t df P
Average fl uency rating 4.13 0.73 4.15 0.79 0.128 29 0.899
Average content rating 4.18 0.79 4.40 0.86 1.58 29 0.125
Average form rating 3.90 0.78 4.02 0.80 1.07 29 0.293
Average total rating 4.07 0.71 4.19 0.76 1.14 29 0.262
Note SD = standard deviation
TABLE 1 Research Design
Group N Version Time limit Version Time limit
Trang 7than those allocated to the short one Although mean scores for each of the analytic criteria were consistently higher in the 55-minute condition,
a paired samples t test ( Table 2 ) showed that none of these mean
differ-ences was statistically signifi cant Second, a Spearman rho correlation was used to ascertain if the ranking of the candidates was different under the two time conditions Table 3 presents the correlations for the fl uency, content, and form scores under the two conditions as well as a correla-tion for the averaged, overall score
Although the correlations in Table 3 are all signifi cant, they vary some-what in strength The average scores for writing produced under the short and long time condition correlate more strongly than do the ana-lytic scores assigned to particular writing features The correlations are lowest for the fl uency criterion, although a Fisher R-to-Z transformation indicates that the size of this coeffi cient does not differ signifi cantly from the others
Research Question 2 Are raters’ judgments of writing ability equally reliable under each time allocation?
It was of further interest to determine if there were any differences in the reliability of rater judgments under the two time conditions Table 4 presents the correlations between the two raters under the two time conditions Although the correlation coeffi cients for the short and long conditions were not signifi cantly different from one another, Table 4 shows that correlations were consistently higher for the short time condition
TABLE 3 Correlations of Scores Under Short and Long Condition
Note All results signifi cant at 0.01 level (2-tailed)
TABLE 4 Rater Correlations
Note All results signifi cant at 0.01 level (2-tailed)
Trang 8DISCUSSION
The current study’s purpose was to determine both the validity and
prac-tical implications of reducing the time allocation for the DELA/DELNA
writing test from 55 to 30 minutes Mean score comparisons showed that
students performed very similarly across the two task conditions Although
this result accords with those of writing researchers such as Kroll (1990),
Caudery (1990), Powers and Fowles (1996), it is somewhat at odds with
Biola (1982), Crone et al (1993), and Younkin, (1986), who showed that
students performed signifi cantly better when more time was given for
their writing However, as already suggested in our review of the
litera-ture, the differences between these studies’ fi ndings may be partly a
func-tion of sample size
Worthy of note in our study is the greater discrepancy in means for
con-tent between the long and short writing conditions The fact that test takers
scored marginally higher on this category under the 55-minute condition
is unsurprising, given that it affords more time for test takers to generate
ideas on the given topic In general, however, the practical impact of the
score differences observable from this study are likely to be negligible
One might argue that shortening the task will produce slightly depressed
means for the undergraduate population as a whole, with the result that
a marginally higher proportion of students receive a recommendation of
“needs support.” However, this is hardly of a magnitude that would create
signifi cant strain on institutional resources and is, in any case, potentially of
benefi t in terms of ensuring that a larger number of borderline students
are fl agged, thereby gaining access to language support classes
More important is the question whether the writing construct changes
when the time allocation decreases, because this has implications for the
validity of inferences drawn about test scores The cross-test correlational
statistics are not strong for any of the rating criteria, and this is
particu-larly true for fl uency, implying that opportunities to display coherence
and other aspects of writing fl uency may differ under the two time
condi-tions These construct differences have potential implications for EAP
support teachers who may use DELA/DELNA writing profi les to
deter-mine how to focus their interventions It cannot however be assumed
that the writing produced in the short time condition is a less valid
indi-cator of candidates’ academic writing ability than writing produced within
the long time frame
As for interrater reliability, the fi ndings of this study revealed (as in
the Hale, 1992 study) that scoring consistency was acceptable and
compa-rable across the two time conditions In fact, the data reported here
sug-gest that alignment between raters increases slightly in the short writing
condition on each of the writing criteria Because this fi nding is not
statistically signifi cant, it is not appropriate to speculate further about
Trang 9possible reasons for this outcome, but this issue is certainly worth explor-ing further with a larger data set In the meantime we can conclude that shortening the writing task presents no disadvantage as far as reliability of rating is concerned
The issue investigated in this small-scale preliminary study certainly begs further investigation, both with a larger sample, and using methods not yet applied in research on the impact of timing on writing perfor-mance Procedures such as think-aloud verbal reports and discourse anal-ysis could be used to get a better sense of any construct differences resulting from the time variable than can be gleaned from a quantitative analysis If writing produced under the 55-minute condition were found
to show more of the known and researched characteristics of academic discourse than that produced within the 30-minute condition, this result would have important validity implications with regard to the diagnostic capacity of the procedure and its usefulness for students, teaching staff and other stakeholders A further issue, which is the subject of a subse-quent investigation, is how test takers feel about doing the writing task under more stringent time conditions Although we have shown that enforcing more stringent time conditions does not make a difference to test scores, it may be perceived as unfair, making it less likely that students will take their results seriously and act on the advice given However, we could caution that any decision based on these results will, as is the case with any language testing endeavor, involve a trade-off between what is feasible and what is desirable in the context of concern
ACKNOWLEDGMENTS
The authors thank Martin von Randow for assistance with aspects of the study design and Janet von Randow and Jeremy Dumble for their efforts in administering the test tasks and recruiting participants and raters for this study
THE AUTHORS
Cathie Elder is director of the Language Testing Research Centre at the University of Melbourne, in Carleton, Victoria, Australia Her major research efforts and output have been in the area of language assessment She has a particular interest in issues
of fairness and bias in language testing and in the challenges posed by the assessment
of language profi ciency for specifi c professional and academic purposes
Ute Knoch is a research fellow at the Language Testing Research Centre, University
of Melbourne, in Carleton, Victoria, Australia Her research interests are in the areas
of writing assessment, rating scale development, rater training, and assessing lan-guages for specifi c purposes
Ronghui Zhang is a lecturer in the Department of Foreign Languages at Shenzhen Polytechnic Institute, Shenzhen, China Her research interests are in the area of for-eign language pedagogy and writing assessment
Trang 10REFERENCES
Biola, H R (1982) Time limits and topic assignments for essay tests Research in the
Teaching of English, 16, 97–98
Caudery, T (1990) The validity of timed essay tests in the assessment of writing skills
ELT Journal, 44, 122–131
Crone, C., Wright, D., & Baron, P (1993) Performance of examinees for whom English is
their second language on the spring 1992 SAT II: Writing Test Unpublished manuscript
prepared for ETS, Princeton, NJ
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J (2007) Evaluating rater
responses to an online rater training program Language Testing, 24 , 37–64
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J (2005) Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2 , 175–196
Elder, C., & Von Randow, J (in press) Exploring the utility of a Web-based English
language screening tool Language Assessment Quarterly
Ellis, R (Ed.) (2005) Planning and task performance in a second language Oxford:
Oxford University Press
Hale, G (1992) Effects of amount of time allocated on the Test of Written English (Research
Report No 92-27) Princeton, NJ: Educational Testing Service
Knoch, U., Read, J., & von Randow, J (2007) Re-training writing raters online: How
does it compare with face-to-face training? Assessing Writing, 12, 26–43
Kroll, B (1990) What does time buy? ESL student performance on home versus class
compositions In B Kroll (Ed.), Second language writing: Research insights for the
class-room Cambridge: Cambridge University Press
Livingston, S A (1987, April) The effects of time limits on the quality of student-written
essays Paper presented at the meeting of the American Educational Research
Association, Washington, D.C., United States
McNamara, T., & Lumley, T (1997) The effect of interlocutor and assessment mode
variables in overseas assessments of speaking skills in occupational settings
Language Testing, 14 , 140–156
Powers, D E., & Fowles, M E (1996) Effects of applying different time limits to a
proposed GRE writing test Journal of Educational Measurement, 33 , 433–452
SPSS, Inc (2006) SPSS (Version 15) [Computer software] Chicago: Author
Weigle, S C (2002) Assessing Writing Cambridge: Cambridge University Press
Younkin, W F (1986) Speededness as a source of test bias for non-native English
speakers on the College level Academic Skills Test Dissertation Abstracts International,
47/11-A, 4072
Effect of Repetition of Exposure and Profi ciency
Level in L2 Listening Tests
HIDEKI SAKAI
Shinshu University
Nagano, Japan
Second language (L2) listening test developers must take into account
a variety of factors such as the characteristics of the input, the task, and