A Mixed-Methods, Cross-Sectional Study KHALED BARKAOUI York University Toronto, Ontario, Canada This study adopted a mixed-methods cross-sectional approach to identifysimilarities and di
Trang 1Do ESL Essay Raters’ Evaluation
Criteria Change With Experience? A
Mixed-Methods, Cross-Sectional Study
KHALED BARKAOUI
York University
Toronto, Ontario, Canada
This study adopted a mixed-methods cross-sectional approach to identifysimilarities and differences in the English as a second language (ESL)essay holistic scores and evaluation criteria of raters with different levels
of experience Each of 31 experienced and 29 novice raters rated asample of ESL essays holistically and analytically and provided writtenexplanations for each holistic score they assigned Score and qualitativedata analyses were conducted to identify the criteria that the ratersemployed to rate the essays holistically The findings indicated that bothgroups gave more importance to the communicative quality of the essaysthan to other aspects of writing However, the novice raters tended to bemore lenient and to give more importance to argumentation than theexperienced raters did The experienced raters tended to be moresevere, to give more importance to linguistic accuracy, and to refer toevaluation criteria other than those listed in the rating scale morefrequently than the novices did The article concludes with a call forlongitudinal research to investigate to what extent, how, and why raterevaluation criteria change over time and across contexts
doi: 10.5054/tq.2010.214047
T here is ample evidence that raters from different backgrounds mayconsider and weight evaluation criteria differently when assessingEnglish as a second language (ESL) essays holistically Holistic ratingconsists in reading a writing sample and choosing one score to reflect theoverall quality of the paper (Goulden, 1994; Weigle, 2002) Holisticscales usually list evaluation criteria without specifying how importanteach criterion is to the overall score As a result, raters may use personaljudgment to determine the importance of different rating criteria and/
or include evaluation criteria not listed in the rating scale when deciding
on an overall score (Goulden, 1994) The result is large variabilitybetween (and within) raters in terms of rating criteria and scores Thisvariability is further exacerbated by variability in rater background andexperience
Trang 2For example, Connor-Linton (1995) and Shi (2001) found that,although native (NES) and nonnative (NNES) speakers of Englishassigned similar scores to English as a foreign language (EFL) essays,they provided different reasons for assigning the same scores Studiescomparing raters from different academic and educational backgroundsalso found that faculty from different departments rated and reacteddifferently to different aspects of ESL essays and disagreed as to whenvarious criteria were being met (e.g., Mendelsohn & Cumming, 1987;Santos, 1988) Other studies (e.g., Brown, 1991; O’Laughlin, 1994)found no statistically significant differences in the holistic ratingsassigned by ESL and English composition teachers to ESL essays,although the two groups gave different reasons for their holisticjudgments of the same essays Finally, Cumming, Kantor, and Powers(2002) found that ESL teachers tended to focus more on language issuesthan did the English teachers, who focused more on ideas or content.Although these studies highlight the variability in evaluation criteriabetween raters from different backgrounds, little is known aboutvariability within raters across time Few studies have examined therelationship between teaching and rating experience and essay evalua-tion criteria These studies adopted a cross-sectional approach, compar-ing the scores and/or evaluation criteria of raters who differed in terms
of their experience teaching and assessing ESL writing Most of thesestudies also used think-aloud protocols to compare the evaluationcriteria and rating processes of novice and experienced raters (e.g.,Cumming, 1990; Delaruelle, 1997; Erdosy, 2004; Sakyi, 2003; Weigle,1999) Cumming (1990), for example, found that experienced teachersused a large and varied number of criteria and knowledge sources toread and judge ESL essays, whereas novice teachers tended to evaluateessays with only a few of these criteria, using skills that may derive fromtheir general reading abilities or other knowledge they have acquiredpreviously
Although their primary objective was not the study of change in raterevaluation criteria as a function of experience, other studies have yieldedresults that are informative concerning this question Lumley andMcNamara (1995) reported evidence indicating that rater severity andself-consistency change over time, whereas Song and Caruso (1996)found that experienced teachers tended to be less severe in their holisticscoring of ESL essays than were less experienced raters Sweedler-Brown(1985), by contrast, found that experienced raters tended to be moresevere, and, as a result, she speculated that rating and teachingexperience might give raters the confidence to rate more critically Avariety of factors, such as study context, writing tasks, evaluation criteria,and participants’ backgrounds, may explain the discrepancy between thefindings of Sweedler-Brown and Song and Caruso
Trang 3Rinnert and Kobayashi (2001) found a significant interaction effectbetween rater first-language (L1) background and experience on raterevaluation criteria when assessing EFL essays holistically Althoughnovice NNES raters (i.e., inexperienced Japanese EFL students)attended mainly to content when judging and commenting on theessays, more experienced NNES raters (i.e., experienced Japanese EFLstudents and EFL teachers), like NES EFL teachers, attended more toclarity, logical connections, and organization Rinnert and Kobayashiinterpreted this finding as indicating a gradual change in NNES readers’perceptions of EFL essays from preferring the writing features of theirL1 to preferring many of the second-language (L2) writing features.Hamp-Lyons (1989) also observed a shift in NES raters’ evaluationcriteria As they gain more experience with other cultures and theirlanguages, NES raters become used to different types of rhetoricalpatterns and transfer across languages and, consequently, tend to reactless unfavorably to the English writing of members of those languagecommunities.
None of the studies reviewed earlier, however, specifically examinedthe relationship between rater experience and evaluation criteria Thethink-aloud studies described earlier focused on differences in the ratingprocesses, particularly decision-making behaviors, between novice andexperienced raters, rather than differences in their evaluation criteriaper se Methodologically, various approaches have been used to identifythe evaluation criteria that raters employ when evaluating ESL essaysholistically The most common approach is to examine the correlationsbetween the holistic scores raters assign and measures of specific essayfeatures These measures include analytic ratings of specific aspects ofthe essays (e.g., language, organization) by research participants (e.g.,Tedick & Mathison, 1995), text analysis or coding of essay features by theresearcher (e.g., Frase, Faletti, Ginther, & Grant, 1999; Homburg, 1984),and rewriting essays to reflect strengths and weaknesses in specificwriting areas (e.g., Kobayashi & Rinnert, 1996; Mendelsohn &Cumming, 1987) Other studies have used self-report data in the form
of interviews (e.g., Erdosy, 2004), questionnaires (e.g., Shi, 2001),written score explanations (e.g., Milanovic, Saville, & Shuhong, 1996;Rinnert & Kobayashi, 2001), and think-aloud protocols (e.g., Cumming
et al., 2002; Delaruelle, 1997) to identify the evaluation criteria thatraters employ when assessing ESL essays holistically
THE PRESENT STUDY
The current study is part of a larger research project that comparedthe rating processes and outcomes of experienced and novice raters
Trang 4when using holistic and analytic rating scales to evaluate ESL essays(Barkaoui, 2008) The study described in this article combined bothscore analysis and self-report data to identify and compare the evaluationcriteria that novice and experienced raters attend to when rating ESLessays holistically Specifically, this study addressed two researchquestions:
1 What aspects of writing explain the holistic scores that raters assign toESL essays?
2 To what extent and how do the aspects of writing that explain ESL essayholistic scores vary in relation to rater experience?
Method
Participants
The study included 31 novice and 29 experienced raters Participantswere assigned to groups based on their response to a backgroundquestionnaire Experienced raters were graduate students and/or ESLinstructors who had been teaching and rating ESL writing for at least 5years, had a master of arts or master of education degree, had receivedspecific training in assessment and essay rating, and rated themselves ascompetent or expert raters Novice raters were mainly students who wereenrolled in or had just completed a preservice or teacher-trainingprogram in ESL, had no ESL teaching and rating experience at all at thetime of data collection, and rated themselves as novices The participantswere recruited from various universities in southern Ontario, Canada.They varied in terms of their gender, age, and L1 backgrounds, but allwere native or highly proficient NNES Table 1 describes the profile of atypical participant in each group
Essays
The study included 180 essays produced under real exam conditions
by 180 adult ESL learners from diverse parts of the world and at differentlevels of proficiency in English.1 Each essay was written within 30minutes in response to one of two comparable independent prompts,
1 The 180 essays used in this study were obtained from the Test of English as a Foreign Language (TOEFL) The 180 test takers who wrote the essays came from more than 35 different countries and from about 30 different L1 backgrounds; the majority were Japanese, Spanish, and Korean speakers Their ages ranged between 16 and 45 years (mean [M] 5 25 years, standard deviation [SD] 5 6) The great majority (91%) took the test to pursue graduate (58%) or undergraduate (33%) studies Their TOEFL scores ranged between 90 and 290 (M 5 212.56, SD 5 44.51).
Trang 5one on the importance of the study of some academic subjects (studytopic) and one on the advantages and disadvantages of practicing sports(sports topic).
Data Collection Procedures
The 180 essays were first randomly compiled into batches of 24 essays,and then batches were randomly assigned to raters.2Each rater received a30-min individual training session on using a holistic and an analyticrating scale and then rated their batch of essays holistically andanalytically, with half the participants rating the essays holistically firstand the other half rating them analytically first (i.e., a counterbalanceddesign) The holistic and analytic scales, borrowed from Hamp-Lyons(1991, pp 247–251), included exactly the same evaluation criteria,wording, and number of score levels (9) The evaluation criteria in theanalytic scale were grouped under five categories: communicative quality,argumentation, organization, linguistic accuracy, and linguistic appro-priacy In addition to scoring the essays, the participants were instructed
to provide a brief written explanation for each holistic score they assigned.The essays were rated individually, at the rater’s home, and there was aspan of at least 2 weeks between the holistic and analytic ratings Eachparticipant rated the same batch of 24 essays holistically and analyticallybut in a different random order of essays and prompts
TABLE 1
Typical Profile of a Novice and an Experienced Rater
Novice Experienced Role at time of the research TESL student ESL teacher
ESL teaching experience None 10 years or more Rating experience None 5 years or more Postgraduate study None MA or MEd
Received training in assessment No Yes
Self-assessment of rating ability Novice Competent or expert Note TESL 5 teaching English as a second language MA 5 master of arts MEd 5 master of education.
Trang 6participants provided for each holistic score they assigned All the raterswere included in the score analyses, but one rater from each group wasexcluded when analyzing the score explanation data, since he or she didnot provide explanations for more than one-third of the holistic scoresthey assigned The final sample consisted of 1,069 score explanationsprovided by 28 experienced and 30 novice raters The novices provided
571 (53%) of the score explanations
The score explanations were typed into word-processing files andthen coded using the computer program NVivo (Richards, 1999) Giventhat these explanations were brief (24 words or less), the unit of analysisadopted was the whole score explanation provided A coding scheme wasdeveloped based on the criteria in the rating scales, preliminaryinspections of the data, and Cumming et al.’s (2002) empirically basedschemes of rater decision-making behaviors and aspects of writing towhich raters attend The scheme consisted of 24 codes under seven maincategories, five related to the categories in the analytic rating scale, oneconcerned comments on the overall quality of the essay (e.g., ‘‘pooressay’’), and one related to comments on aspects of writing other thanthose included in the holistic scale (e.g., task completion, quantity) Acomplete list of the codes with examples from the current study appears
in the appendix
Each score explanation was coded in terms of focus (i.e., one of theseven coding categories in the appendix) and type (i.e., positive,negative, or neutral) as follows Each score explanation was firstclassified as being related to one or more of the seven main categories
in the appendix The score explanation was then further categorized interms of one or more of the subcategories under each category Forexample, a comment on argumentation could be classified as beingrelated to relevance, interest, support, and/or other argumentationaspects When tallying the number of comments under each categoryand subcategory, each comment was counted only once for that category
or subcategory For instance, a comment that was coded as being related
to interest and relevance, both under argumentation, was counted onceunder each of these two subcategories but also only once under themain category of argumentation Finally, each comment was coded asbeing positive, negative, or neutral Comments were coded as neutral ifthey were neither positive nor negative and/or were ambiguous (e.g.,
‘‘Similar to the previous essay’’; ‘‘Language fits description of # 5’’;
‘‘Common ESL mistakes’’; ‘‘Long essay’’) Many score explanationsincluded both negative and positive comments; such comments werecoded as both negative and positive (i.e., twice) For example, the scoreexplanation ‘‘Well structured, but should be longer,’’ was coded as beingpositive for text organization (under organization) and negative inrelation to quantity (under other aspects of writing)
Trang 7The author coded all the data in this study by assigning each scoreexplanation all relevant codes in the coding scheme To check thereliability of the coding, the coding scheme was discussed with anotherresearcher, who then independently coded a random sample of 250written score explanations (1,127 codes) Percentage agreementachieved was 90%, computed for agreement in terms of the maincategories: organization, argumentation, linguistic accuracy, linguisticappropriacy, overall impression, and other aspects of writing (seeappendix) Percentage agreements for main categories and within eachcategory varied, however (e.g., 85% for argumentation; 93% forlinguistic accuracy) All the difficult cases were discussed and, for mostcases, the codes were reconciled In the few cases where an agreementwas not reached, the author selected the final code to be assigned.Because the focus in this study was on comparing the frequency of focusand type of comments across rater groups, the coded data were talliedand percentages were computed for each rater for each code in thecoding scheme These percentages served as the data for comparisonacross rater groups Statistical tests were then conducted on the maincategories in the appendix Subcategories were used for descriptivepurposes only and to explain significant differences in main categories.Because the coded data did not seem to meet the statistical assumptions
of parametric tests, nonparametric tests (Mann-Whitney test) wereused to compare coded data across rater groups To examine therelationships between the score explanations and the holistic scores thatthe raters provided, Spearman rho correlations were conducted.Because nonparametric tests rely on ranks, the following descriptivestatistics are reported below: mean, median, standard deviation, andrange (i.e., the difference between the highest and lowest values; Field,2005)
To examine the relationships between the analytic and holistic scoresand the effects of rater experience on these relationships, multilevelmodeling (MLM) was used MLM is an advanced form of multiple-regression analysis that takes into account the hierarchical structure ofdata (Hox, 2002; Luke, 2004) Hierarchical data means that observations
at lower levels are nested within units at higher levels In this study,ratings are nested within raters With nested data, there may be morevariation between raters than within raters, a violation of theindependence of observations assumption that underlies traditionalmultiple-regression analysis MLM addresses this problem, because itassumes independence of observations between raters, but not betweenratings within a rater (Hox, 2002; Luke, 2004) MLM also allows theexamination of the effects of rater variables (e.g., experience) on holisticscores (main effects) and on the relationships between the analytic andholistic scores (called cross-level interaction effects in MLM; Hox, 2002)
Trang 8The software program HLM 6.0 for Windows (Raudenbush, Bryk,Cheong, & Congdon, 2004) was used to build and test various MLMmodels, following procedures suggested by Hox (2002), before identify-ing the final model that fit the data In addition to the outcome variable,holistic scores, the study included one rater-level (called Level-2 in MLM)predictor, rater experience (coded 0 for novice and 1 for experienced),and seven measures of essay features that constitute the Level-1predictors These measures were the five categories in the analytic scale
as well as essay length (number of words per essay measured using theword count function in Microsoft Word) and essay topic The promptwas used as a measure of essay topic (what the essay is about), with thestudy prompt coded 0 and the sports prompt coded 1
FINDINGS
Score Analyses
Table 2 reports descriptive statistics and correlations between theholistic and analytic ratings by rater group It shows that the novice grouphad slightly higher means and standard deviations than the experiencedgroup did for the holistic scale and each of the analytic scales In addition,the correlations between the holistic ratings, on the one hand, andcommunicative quality, organization, and argumentation, on the otherhand, were slightly higher for the novice raters, but those correlationsbetween holistic ratings and linguistic accuracy, linguistic appropriacy,and essay length were higher for the experienced group The followingparagraphs examine whether these differences are statistically significant.MLM was used to (a) examine whether the two rater groups differedsignificantly in the holistic scores they assigned, (b) identify which essay
TABLE 2
Descriptive Statistics and Pearson r Correlations by Rater Group
Novice Experienced Total
Holistic 5.48 1.57 1.00 5.08 1.41 1.00 5.29 1.51 1.00
CQ 5.69 1.61 0.64 5.48 1.34 0.62 5.59 1.49 0.63 ORG 5.77 1.67 0.61 5.33 1.35 0.59 5.56 1.54 0.61 ARG 5.57 1.75 0.63 5.16 1.46 0.56 5.38 1.62 0.61 LAC 5.36 1.52 0.56 5.04 1.30 0.61 5.21 1.43 0.58 LAP 5.54 1.55 0.55 5.29 1.34 0.58 5.42 1.45 0.57 Length 239.61 85.35 0.24 240.05 84.38 0.27 239.83 84.85 0.25 Note All rating criteria are measured on a nine-point scale M 5 mean; SD 5 standard deviation;
CQ 5 communicative quality; ORG 5 organization; ARG 5 argumentation; LAC 5 linguistic accuracy; LAP 5 linguistic appropriacy; Length 5 number of words per essay.
Trang 9features account for differences in the holistic scores the raters assigned,and (c) assess whether novice and experienced raters gave differentweights to different evaluation criteria in the holistic scores theyassigned Five MLM models were examined before building the finalmodel These exploratory models indicated that essay length, commu-nicative quality, argumentation, and linguistic accuracy had significantassociations with the holistic scores, whereas topic, organization, andlinguistic appropriacy did not, at p , 0.05 Second, the within-raterrelationships between the holistic scores, on the one hand, and each oftopic, essay length, communicative quality, and linguistic accuracy, on theother, varied significantly across raters Third, on average, the experi-enced raters assigned significantly lower holistic scores than the noviceraters did to essays on the same topic after statistically accounting fordifferences in terms of essay length, communicative quality, argumenta-tion, and linguistic accuracy Fourth, rater experience significantlymoderated the relationships between the holistic scores, on the onehand, and the argumentation and linguistic accuracy scores, on the other.Based on the results of the exploratory models and analyses, a finalmodel was specified that included five measures of essay features: essaylength, topic, communicative quality, argumentation, and linguisticaccuracy The exploratory models indicated that (a) these five predictorshad significant associations with the holistic scores, (b) their associationswith the holistic scores varied significantly across raters, and/or (c) theirassociations with the holistic scores were significantly influenced by raterexperience Organization and linguistic appropriacy did not meet any ofthe three criteria and, as a result, were not included in the final model.The results for the final MLM model are presented in Table 3.
Table 3 reports various statistics The first set of statistics is the Level-1fixed effects, which refer to (a) the average intercept and (b) the averageassociations between each of the Level-1 predictors and the holisticscores First, the intercept represents the average holistic score assigned
by the novice raters to essays on the study prompt and adjusted for allthe four essay features in the final model The value of the intercept is5.53 and can be thought of as a baseline against which all other values inTable 3 are interpreted Second, Table 3 shows that the averageassociations between essay length, communicative quality, argumenta-tion, and linguistic accuracy, on the one hand, and holistic scores, onthe other, are significant For instance, the association for commu-nicative quality (0.28) is positive and significant at p , 0.01, indicatingthat, on average, essays with high scores on this criterion obtainedhigher holistic scores (0.28 points higher), after accounting for theeffects of all other predictors in the final model By contrast, theassociation is 20.09 for topic, indicating that, on average, the sports
Trang 10topic (coded 1) resulted in lower scores than the study topic (coded 0),but this difference was not statistically significant.
The average association of essay length with the holistic scores was0.003, indicating that, on average, the holistic scores increased by 0.003points with each additional word (i.e., 0.30 points per 100 words) Notethat the coefficients in column 2 are the unstandardized coefficients ofthe associations Because the predictors were measured on differentscales (e.g., essay length was measured in terms of number of words,whereas communicative quality was measured on a nine-point scale),these coefficients needed to be standardized to allow comparison of thestrength of associations across predictors The standardized coefficientsappear in column three of Table 3 When the coefficients of associationsare standardized, communicative quality and argumentation have thehighest average associations with holistic scores (0.27 and 0.24,respectively), followed by essay length (0.17) The association betweenlinguistic accuracy and holistic scores is the lowest (0.10) In otherwords, on average, communicative quality and argumentation played themost prominent roles in the holistic scores the raters assigned, followed
by essay length and linguistic accuracy
The second set of statistics in Table 3 is the Level-2 fixed effects andconcerns the direct effects of rater experience on (a) the holistic scores
TABLE 3
Results for Final MLM Model
Fixed effects
Unstandardized coefficient (SE)
Standardized coefficient Level 1
Trang 11and (b) the associations between the Level-1 predictors (i.e., essayfeatures) and the holistic scores First, the coefficient for the interceptindicates the main effect of rater experience on essay scores Table 3shows that rater experience had a significant main effect on theintercept, such that experienced raters (coded 1), on average, assignedscores that were 0.40 points lower than those assigned by the noviceraters (coded 0), after accounting for the effects of all predictors in thefinal model Thus, although the novices, as a group, assigned an averagescore of 5.53 to essays on the study topic (as noted earlier), theexperienced raters assigned an average score of 5.53 2 0.40 or 5.13 tothe same essays Second, the coefficients for the argumentation andlinguistic accuracy slopes indicate the effects of rater experience on therelationships between each of these two predictors and the holisticscores Thus, although the average (across all raters) unstandardizedassociation between argumentation and holistic scores was 0.22, thisassociation was significantly weaker for the experienced raters (0.22 20.10 5 0.12) By contrast, the average (unstandardized) associationbetween linguistic accuracy and holistic scores was 0.11, but this
association was significantly stronger for the experienced raters (0.11 +
0.20 5 0.31) As indicated in Table 2, (a) the correlation between theargumentation and holistic scores was higher for the novice raters (r 50.63) than for the experienced raters (r 5 0.56), whereas (b) thecorrelation between the linguistic accuracy and holistic scores washigher for the experienced group (r 5 0.61) than for the novices (r 50.56) The results of MLM analyses indicated that, after accounting forthe main effects of essay features and rater experience on the holisticscores, the two groups differed significantly in terms of the magnitude ofthe relationship between their holistic scores, on the one hand, and theargumentation and linguistic accuracy scores they assigned, on theother
The third set of coefficients in Table 3 is the random effects, whichindicate the extent to which the association between each of the Level-1predictors, on the one hand, and the holistic scores, on the other, variedacross raters First, the between-rater variance indicates that the averageholistic score (for 24 essays) for individual raters varied significantlyacross raters (minimum rater average 5 3.79, maximum rater average 57.54) The association between topic, length, communicative quality,and linguistic accuracy, on the one hand, and the holistic scores, on theother, for individual raters, also varied significantly across raters Forexample, although the association between communicative quality andholistic scores was, on average, 0.28 (as noted earlier), the strength (andsign) of this association varied significantly across raters Thus, althoughfor some raters this association was negative and/or low (e.g., 20.21),for others it was positive and/or high (e.g., 0.65) Additionally,
Trang 12examination of the correlations between the holistic scores, on the onehand, and each of the features just mentioned, on the other, indicatedthe following:
1 The association between essay length and holistic scores was positive forall raters, indicating that, overall, longer essays obtained higher holisticscores from all raters
2 Once rater experience is taken into account, there was no significantvariability among raters in terms of the association between theargumentation and holistic scores, suggesting that rater experience isthe main factor that accounts for differences across raters in terms ofthis association
3 The associations between topic, communicative quality, and linguisticaccuracy, on the one hand, and holistic scores, on the other, werepositive for some raters and negative for others
4 Although the relationships between holistic scores, on the one hand,and topic, essay length, and communicative quality, on the other hand,varied significantly across raters, rater experience did not seem to affectthese relationships significantly Other rater factors (e.g., L1, writingexperience) might account for these differences between raters
5 Within-rater variance indicates that there is still some variance betweenratings assigned by the same raters, which is not accounted for by thefactors in the model (i.e., topic, essay length, communicative quality,argumentation, and linguistic accuracy) Other essay characteristics(e.g., essay order, content) may account for differences between scoresassigned by the same rater to different essays
Qualitative Data
Table 4 reports (a) descriptive statistics for the aspects of writingreported in the written score explanations by focus and type of maincategory3 and (b) Spearman rho correlations between each of thesecategories and essay length and holistic scores Table 4 shows that, inexplaining the holistic scores they assigned, the participants madecomments related mainly to argumentation (M 5 26%), linguisticaccuracy (M 5 24%), communicative quality (M 5 18%), andorganization (M 5 15%) Linguistic appropriacy was rarely used toexplain the holistic scores assigned (M 5 4%) In addition, the ratersexplained the scores they assigned with reference to their overallimpressions of the quality of the essays (M 5 4%) and other aspects ofwriting than those included in the rating scale, such as task completionand quantity (M 5 6%)
3 All statistics reported in this section are based on percentages, rather than frequencies, of mention, as explained earlier.
Trang 13In terms of type of comments, Table 4 shows that the majority of thecomments (M 5 56%) were negative, indicating that (a) the essays hadnumerous problems, (b) raters tended to comment more frequently onweak aspects, and/or (c) it was easier for raters to perceive and/orcomment on weak aspects of writing than on positive aspects Only one-third of the comments were positive (M 5 34%), whereas 7% werecoded as neutral (i.e., neither positive nor negative, and/or ambiguous).The following paragraphs focus mainly on the positive and negativecomments; the neutral comments are discussed very briefly, because (a)their meaning is not always clear and (b) the proportion of suchcomments is small compared with negative and positive comments.Overall, the largest proportions of the positive comments the raters
TABLE 4
Descriptive Statistics and (Spearman rho) Correlations for Percentages of Writing Aspects Reported in Score Explanations for All Raters (N 5 58 raters)
Focus
Descriptive statistics Correlations with
M Mdn Range SD Holistic scores Essay length
CQ 17.60 16.67 44.44 8.82 20.20** 20.01 ORG 15.42 14.58 43.75 8.65 0.00 0.07* ARG 26.29 25.09 34.65 7.76 0.04 20.02 LAC 23.76 23.89 38.19 8.12 0.11** 0.03 LAP 4.15 2.80 22.30 4.66 0.07* 0.05 Overall 3.80 1.70 20.29 5.27 0.07* 0.05 Other 6.48 6.07 19.31 4.23 20.00 20.15** Type
Positive 34.21 32.76 65.62 13.88 0.55** 0.20** Negative 55.76 58.44 87.50 16.43 20.60** 20.22** Neutral 7.52 2.35 100.00 17.50 0.10** 20.01 Type by focus
Positive
CQ 14.45 13.61 48.61 10.31 0.26** 0.13** ORG 17.89 17.36 71.53 13.08 0.17** 0.08** ARG 18.80 17.88 52.78 12.79 0.34** 0.15** LAC 11.90 7.64 50.35 11.87 0.37** 0.07* LAP 1.59 0.00 11.81 2.68 0.15** 0.06* Overall 3.30 1.24 19.44 4.54 0.22** 0.13** Other 2.46 0.00 31.60 4.88 0.09** 0.05 Negative
CQ 11.14 10.25 29.17 6.66 20.37** 20.06 ORG 7.94 7.29 40.97 7.17 20.26** 0.00 ARG 22.54 22.71 50.14 11.15 20.24** 20.14** LAC 25.77 27.13 63.89 13.12 20.11** 0.02 LAP 4.19 3.08 25.83 5.15 20.04 0.03 Overall 12.43 11.46 29.17 6.72 20.40** 20.08* Other 7.50 6.49 22.50 5.15 20.04 20.18** Note M 5 mean; Mdn 5 median; SD 5 standard deviation; CQ 5 communicative quality; ORG 5 organization; ARG 5 argumentation; LAC 5 linguistic accuracy; LAP 5 linguistic appropriacy; Overall 5 overall impression; Other 5 other aspects of writing * p , 0.05; ** p , 0.01.