The sampling included a focus group of 468 female students compared to areference group of 357 male English as a foreign language EFL learners.. The testpassages and items were also give
Trang 1The Subjective and Objective Interface of Bias Detection on Language Tests
Steven J Ross and Junko OkabeKwansei Gakuin UniversityKobe-Sanda, Japan
Test validity is predicated on there being a lack of bias in tasks, items, or test content
It is well-known that factors such as test candidates’ mother tongue, life experiences,and socialization practices of the wider community may serve to inject subtle interac-tions between individuals’ background and the test content When the gender of thetest candidate interacts further with these factors, the potential for item bias to influ-ence test performances grows A dilemma faced by test designers concerns how theycan proactively screen test content for possible sources of bias Conventional prac-tices in many contexts rely on the subjective opinion of review panels in detectingsensitive topical content and potentially biased material and items In the last 2 de-cades this practice has been rivaled by the increased availability of item bias diagnos-tic software Few studies have compared the relative accuracy and cost utility of thetwo approaches in the domain of language assessment This study makes just thatcomparison A 4-passage, 20-item reading comprehension test was given to a strati-fied sample of 825 high school students and college undergraduates at 5 Japanese in-stitutions The sampling included a focus group of 468 female students compared to areference group of 357 male English as a foreign language (EFL) learners The testpassages and items were also given to a panel of 97 in-service and preservice EFLteachers for subjective ratings of potential gender bias The results of the actual itemresponses were then empirically checked for evidence of differential item function-ing using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, andlogistic regression Concordance analyses of the subjective and objective methodssuggest that subjective screening of bias overestimates the extent of actual item bias.Implications for cost-effective approaches to item bias detection are discussed.The issue of test bias has always been central in the consideration of test validity Biashas been of concern because inferences about the results of test outcomes often lead
Correspondence should be addressed to Steven J Ross, School of Policy Studies, Kwansei Gakuin University, Gakuen 2–1, Sanda, Hyogo, Japan 6691337 E-mail: sross@ksc.kwansei.ac.jp
Trang 2to consequences affecting the life-course trajectories of test candidates, such as inthe use of tests for employment, admissions, or professional certification Test re-sults may be considered unambiguously fair to the extent candidates are compared,
as in the case of norm-referenced tests, on only the domarelevant constructs cluded in the measurement instrument devised for the purpose In the real world oftesting practice, uncontaminated construct relevant domain coverage is often more
in-an ideal thin-an a reality This is especially true when the testing construct involves mains of knowledge or ability related to language learning
do-ISSUES IN SECOND LANGUAGE ASSESSMENT BIAS
Language learning, particularly second or foreign language learning, is influenced
to no small degree by factors that interact with, and that are sometimes even pendent of, the direct consequences of formal classroom-based achievement Yet
inde-in many high stakes contexts, foreign or second language ability is used as agate-keeping criterion for employment and admissions decisions Further, inclu-sion of foreign language ability on selection tests is often predicated on the as-sumption that candidates’ relative standing reflects the cumulative effects ofachievement propelled by long-term commitment to diligent scholarship Theseassumptions do not often factor in the possibly biasing influences of cross-linguis-tic transfer and naturalistic acquisition on individual differences in test outcomes.Constructing high stakes measures to be free of these kinds of bias presents a chal-lenging task to language test designers, particularly when the implicit meritocraticintention is to reward scholastic achievement
Studies of bias on language tests have tended to fall into the three broad categories
of transfer, experience, and socialization practices The first, which accounts for theinfluence of transfer from a first-learned language to a second or foreign language,addresses the extent of bias occurring when speakers of different first languages aretested on a common second language Chen and Henning (1985), for instance, notedthe transferability of Latin cognates from Spanish to English lexical recognition,which served to bias native speakers of Spanish over native speakers of Chinese.Working in the same vein, Sasaki (1991) corroborated Chen and Henning using a dif-ferent DIF detection method Both of these studies suggested that when novel wordsare encountered by Spanish and Chinese speakers, the cognitive task of lexical infer-ence differs For instance, consider the following sample sentence:
Residents evacuated their homes during the conflagration
For Romance language speakers, the deductive task is to parse “conflagration” forits affixation and locate the core free morpheme Once located, the Romance lan-
Trang 3guage speaker can compare the root to similar known free morphemes in thereader’s native language, for instance, incendio or conflagracion.
The Chinese speaker, in contrast, starts at the same deductive step, but mustcompare the free root morpheme to all other previously learned morphemes (i.e.,most probably, “flag”) The resulting difference leads Spanish speakers to follow asemantically based second step, while Chinese speakers are likely to split between
a semantic and phonetic comparison strategy The item response accuracy in suchcases favors the Romance language speakers, even when matched with Chinesecounterparts for overall proficiency
The transferability factor applies to orthographic phenomena as well Brownand Iwashita (1996) detected bias favoring Chinese learners of Japanese over na-tive English speakers, whose native language orthography is typologically mostdistant from Japanese Given the fact that modern written Japanese relies on Chi-nese character compounds for the formation of nominal phrases, as well as the rootforms of many verbs, Chinese students of Japanese can transfer their knowledge ofsemantic roots for many Japanese words and compounds, even without knowledge
of their corresponding phonemic representations, or exact semantic reference.Here a similar strategic difference emerges for speakers of Chinese versusspeakers of an Indo-European language While the exact compound might not ex-ist in modern written Chinese, the component Chinese characters provide a deduc-tive strategy to Chinese learners of Japanese that is not available to English speak-ers For instance, the following sentence contains the compound (bullettrain) which does not have a direct counterpart in Chinese
The component characters “new,” “trunk,” and “line” provide the basisfor a lexical inference that the compound refers to a kind of rail transportation sys-tem For an English-speaking learner of Japanese, the cognitive load falls on de-ducing the meaning of the whole compound from its components Here, a mixedgrapheme to phoneme strategy is most likely if “new” and “line” are recog-nized as “shin” and “sen.” The lexical inference here might entail filling in themissing component “trunk” with a syllable that matches the surrounding
“shin _sen” for successful compound word recognition
Examining transferability on a macrolevel, Ross (2000), while controlling forbiographical and experiential factors such as age, educational background, andhours of ESL learning, found weaker evidence of a language distance factor Thedistance factor was comprised of canonical syntactic structure, orthography, andtypological grouping which served to influence the relative rates of learningEnglish by 72 different groups of migrants to Australia
The overall picture of transfer bias suggests that on the microlevel, particularly
in studies that triangulate two different native languages against a target language,
Trang 4evidence of transfer bias tends to be identifiable When many languages are pared and individual differences in experiential and cognitive variables are fac-tored in, transfer bias at the macro or language typological level appears to be lessreadily identifiable.
com-A second type of bias in language assessment arises from differential exposure to
a target language that candidates might experience Ryan and Bachman (1992), forinstance, considered Test of English as a foreign language (TOEFL) type items to bemore culturally oriented toward the North American context than a British compari-son, the First Certificate in English Language learners with exposure to instruction
in American English and test TOEFL preparation courses were thought to have agreater chance on such items than learners whose exposure did not prepare them forthe cultural framework TOEFL samples in its reading and listening items Theirfindings suggest that high stakes language tests for admissions such as TOEFL mayindirectly include knowledge of cultural reference in addition to the core linguisticconstructs considered to be the object of measurement Presumably this phenome-non would be observable on language tests such as the International English Lan-guage Testing System (IELTS), which is designed to qualify candidates for admis-sions to universities in the United Kingdom, New Zealand, or Australia
Cultural background comparisons in second language performance assessmentshave demonstrated how speech community norms may transfer into assessment pro-cesses like oral proficiency interviews While not overtly recognized as a source ofassessment bias, interlanguage pragmatic transfer has been seen to influence the per-formances of Asian speakers when compared to European speakers (Young, 1995;Young & Halleck, 1998; Young & Milanovic, 1992) The implication is that if as-sessments are norm-referenced, speakers from discourse communities favoring ver-bosity may be advantaged in assessment such as interactive interviews This obser-vation apparently extends to semi-direct speech tasks such as the SPEAK test Kim(2001), for instance, found differential rating functions for pronunciation and gram-mar ratings for Asians when compared to equal ability European test candidates Theimplication here is that raters apply the rating scale differently
In considering possible sources of bias in university admissions, Zwick andSklar (2003) opined that the foreign language component on the SAT II created a
“bilingual advantage” for particular candidates for admission to the University ofCalifornia If candidates had been raised in bilingual households, for instance, theywould be expected to score higher on the foreign language listening comprehen-sion component, which is an optional third subscore on the SAT II This test is re-quired for undergraduate admissions to all campuses at the University of Califor-nia The issue of bias in this case stems from the assumption that the foreignlanguage component was presumably conceptualized as an achievement indicator,when in fact the highest scoring candidates are from bilingual households Theperceived advantage is that such candidates develop their proficiency not throughcoursework and scholarship, but through naturalistic exposure
Trang 5Elder (1997) reported on a similar fairness issue arising from the use of secondlanguage tests for access to higher education in Australia Elder noted that thescore weighting policy on the Victoria Certificate of Education, functioning as itdoes as a qualification for university admission in that state, explicitly profiles thelanguage learning history of the test candidate This form of candidate profilingaimed to reweight the influence of the foreign language scores on the admissionsqualification so as to minimize the preferential bias bilingual candidates enjoyedover conventional foreign language learners Elder found interactions betweenEnglish and the profile categorizations were not symmetric across different for-eign language test candidatures and concluded that efforts to adjust for differentialexposure profiles are fraught with difficulty.
A third category of bias in language assessment deals with differences in ization patterns Socialization patterns might involve academic tracking early in aschool student’s educational career, usually into either science or humanities aca-demic tracks in high school (Pae, 2004) In some cultural contexts, academic track-ing might correspond to gender socialization practices as well
social-In contrast to cultural assumptions made about the verbal advantage femaleshave over males, Hyde and Linn (1988) concluded in a meta-analysis of 165 stud-ies of gender differences on all facets of verbal tests that there was an effect size of
D = 11 for gender differences To them, this constituted little firm evidence to port the assumed female verbal advantage Willingham and Cole (1997), andZwick (2002) concur with this interpretation, noting that gender differences havesteadily diminished over the last four decades and now account for no more than1% of the total variation on ability tests in general Willingham and Cole (1997, p.348) however, noted that females tend to frequent the top 10% in standardized tests
sup-of reading and writing
Surveys of gender differences on The Advanced Placement Test, used for versity admissions to the more selective American universities, suggest reasonswhy verbal differences in literacy still tend to persist Dwyer and Johnson (1997, p.136) describe considerable effect size differences between college-bound malesand females in preference for language studies This finding would suggest that inthe North American context socialization patterns could serve to channel highschool students into academic tracks that tend to correlate with gender
uni-To date, language socialization issues have not been central in foreign or secondlanguage test bias analyses in multicultural contexts because of the more immedi-ate and salient influences of exposure and transfer on high stakes tests In contextsthat are not characterized by multiculturalism, a more subtle threat of bias may berelated to how socialization practices steer males and females into different aca-demic domains, and in doing so cumulatively serve to make gender in particularknowledge domains differentially salient When language tests inadvertently sam-ple particular domains more than others, the issue of schematic knowledge inter-acting with the gender of the test candidate takes on a new level of importance
Trang 6In a study of differential item function (DIF) on foreign language vocabulary testfor Finnish secondary students, Takala and Kaftandjieva (2000) found that individ-ual vocabulary items showed domain-sampling effects, whereas the total score ontest did not reflect systematic gender bias Their study identified how words sampledfrom male activity domains such as mechanics and sports might yield higher scoresfor male test candidates than for females at the same ability level Their approachused conventional statistical analyses of DIF, which, according to some current stan-dards of test practices, would serve to identify and eliminate biased items before testscores are interpreted (American Educational Research Association, American Psy-chological Association, & National Council on Measurement in Education, 1999).With such practices for bias-free testing, faulty items would be screened throughsensitivity review and content moderation prior to test administration, and then sub-jected to DIF analyses before the final score tally.
The issue of interest we address in this article is how gender bias on foreign guage tests devised for high stakes purposes can be diagnosed when accepted cul-tural practices disfavor the use of empirical analysis of item functioning prior toscore interpretation In this study we address the issue of the accuracy of sensitivityreview and bias screening through content moderation prior to test administration
lan-by comparing the judgments of both expert and novice moderation groups with theresults of three different empirical approaches to DIF
BACKGROUND TO THE STUDY
Four sample subtests written for a high stakes university admissions test were used inthe study The subtests were all from the fourth section of a six section English as aforeign language (EFL) test given annually to approximately 630,000 Japanese highschool seniors The results of the exam are norm-referenced and serve to qualify can-didate for secondary examinations to specific academic departments at national andpublic universities (Ingulsrud, 1994) Increasingly, private Japanese universities usethe results of the Center examination for admissions decisions, making the test themost influential gate-keeping device in the Japanese educational system
The format of the EFL test is a “discrete point” type of test of language structureand vocabulary, sampling the high school syllabus mandated by the Japanese Min-istry of Education It is construed as an achievement test because only vocabularyand grammatical structures occurring in about 40 high school textbooks sanc-tioned by the Ministry of Education are sampled on the test The six sections of theexamination cover knowledge of segmental pronunciation, tonic word stress, dis-crete point grammar, word order, paragraph coherence and cohesion, interpreta-tion of short texts describing graphics and data in tabular format, interactivedialogic discourse in the form of a transcribed conversation, and comprehension of
Trang 7a 400-word reading comprehension passage All items, usually 50 in all, are inmultiple-choice format to facilitate machine scoring.
The test is constructed by a committee of 20 examiners who convene 40 dayseach year to draft, moderate, and revise the examination before its administration
in January each year On several occasions during the test construction period thedraft passages and items are sent out to an external moderation panel for sensitivityand bias review The external moderation panel, whose membership is not known
to the test committee members, is composed of former committee members andexamination committee chairpersons Their task is to critique the draft passagesand items and to recommend changes, large and small On occasion the modera-tion panel recommends substitution of entire draft test sections This usually oc-curs when issues of test sensitivity or bias are raised The criteria for sensitivity arethemselves highly subjective and variable across moderation panels For some, testcontent should involve “heart-warming” topics that avoid dark or pessimisticthemes For others, avoiding references to specific social or ethnic groups may bethe most important criterion
The four passages included in the study were originally drafted for the fourth tion of the EFL language examination The specifications for the fourth section callfor three or four paragraphs describing charts, figures, or tabular data concerning hy-pothetical experimental or survey data in a social science domain This section of thetest is known to be the most domain-sensitive, because the content sampling usuallysits at the borderline of where male–female differences in experiential schemata be-gin to emerge in the population
sec-The four passages were never used in the operational test, but were held in serve as alternates All four had at various stages of development undergone exter-nal review by the moderation panel and were found to be possibly too gender sensi-tive, thus ending further investment in committee time for their revision
re-The operational test is not screened with DIF statistics prior to score tion The current test policy endorsed by the Japanese testing community is predi-cated on the assumption that the moderation panel reviews are sufficiently accurate
interpreta-in detectinterpreta-ing faulty, interpreta-insensitive, or biased items before any are used on the tional test The research issue addressed here thus considers empirical evidence ofthe accuracy of the subjective approach currently used, and directly examines evi-dence that subjective interpretation of gender bias in fact concurs with objectiveanalyses using empirical methods common to DIF analysis
opera-METHOD
The four-passage, 20-item reading comprehension test was given to a stratifiedsample of 825 high school students and college undergraduates at five institutions.The sampling included a focus group of 468 female students compared to a refer-
Trang 8ence group of 357 male EFL learners The aim of the sampling was to approximatethe range of scores normally observed in the population of Japanese high schoolseniors The 20-item test was given in multiple-choice format with enough time (1hr) for completion, and was followed with a survey about the age, gender, andlanguage learning experiences of the sample test candidates.
Materials
The test section specifications call for a three to four paragraph text describinggraphs, figures, or tables written as specimens of social science types of academicwriting In the case of the experimental test, four of these passages were used Each
of the passages had five items that tested readers’ comprehension of the passagecontent The themes sampled on the test can be seen in Table 1
The experimental test comprised of four short reading passages, which closelyapproximate the format and content of Section Four of the Center Examination.The sampling of students in this study yielded a mean and variance similar to theoperational test Table 2 lists descriptive statistics for the test
Bias Survey Procedure
A test bias survey was constructed for use by in-service and preservice EFL ers The sampling of high school level teachers parallels the normal career path ofJapanese members of a typical moderation panel The actual external moderationpanel is comprised of university faculty members, most of whom had followed acareer path starting with junior and senior high school EFL teaching The bias sur-
teach-TABLE 1 Experimental Passage Order and Thematic Content
I Letter rotation experiment
II Visual illusions experiment III Soccer league tournament
IV Survey of transportation use changes
TABLE 2 Mean, Standard Deviation, Internal Consistency,and Sample Size
Trang 9vey was thus devised to sample early, mid, and late career EFL teachers who wereassumed to represent the larger population of language teaching professionalsfrom whom future test moderation panel members are drafted In-service teachers(n = 37) were surveyed individually.
In addition to the sampling of in-service teachers, a larger group of preserviceEFL teachers in training were also surveyed so as to compare the ratings provided
by seasoned professional teachers with neophyte teachers (n = 60) All dents were asked to examine the four test passages and each of the 20 items on thetest before rating the likelihood that each item would favor male or female test can-didates The preservice teachers in training completed the survey during TeachingEnglish as a Foreign Language (TEFL) Methodology course meetings
respon-The rating scale used and instructions are shown in the Appendix
ANALYSES: OBJECTIVE DIFFERENTIAL ITEM
FUNCTIONING ANALYSIS
A variety of options now exist for detecting DIF Comparative research suggeststhat DIF methods tend to differ in the extent of Type I error and power Whitmoreand Schumacker (1999), for instance, found logistic regression more accurate than
an analysis of variance approach A direct comparison of logistic regression andMantel-Haenszel procedure (Rogers & Swaminathan, 1993) indicated moderatedifferences in power Swanson, Clauser, Case, Nungester, and Featherman (2002)more recently approached DIF with hierarchical logistic regression and found it to
be more accurate than standard logistic regression or Mantel-Haenszel estimates
In this approach, different possible sources of item bias can be dummy-coded andnested in the multilevel design Recent uses of logistic regression for DIF extend topolytomous rating categories (Lee, Breland, & Muraki, 2005) but still enable anexamination of nonuniform DIF through interaction terms between matchingscores and group membership
Although multilevel modeling approaches offer extended opportunities for ing nested sources of potential DIF, the single level methods, such as logistic re-gression and Mantel-Haenszel approaches, have tended to prevail in DIF studies.Penfield (2001) compared three variants of Mantel-Haenszel according to differ-ences in the criterion significance level, and concluded that the generalized ap-proach provided the lowest error and most power Zwick and Thayer (2002) foundthat modifications of the Mantel-Haenszel procedure involving an empirical Bayesapproach showed promise of greater potential for bias detection A direct compari-son of the Mantel-Haenszel procedure with Simultaneous Item Bias (SIB;Narayanan & Swaminathan, 1994) concluded that the Mantel-Haenszel procedureyielded smaller Type I error rates relative to SIB
Trang 10test-In this study, three empirical methods detecting of DIF were used The choice
of bias detection methods used was based on their overall frequency of use in pirical DIF studies The three methods were thought to represent conventional ap-proaches DIF research, and thus best operationalize “objective” approaches to becompared with subjective methods
em-Mantel-Haenszel Delta was computed from six sets of equipercentile-matchedability subgroups cross tabulated by gender Differences in the observed Deltas forthe matched males and females were evaluated against a chi-square distribution.This method matches males and females along the latent ability continuum and de-tects improbable discontinuities between the expected percentage of success andthe observed data
The second method of detecting bias was a logistic regression performed on the chotomously scored outcomes for each of the 20 items The baseline model tested theeffects of gender controlling for each student’s total score (Camilli & Shepard, 1994)
di-In this binary regression, the probability of success should be solely influenced by theindividual’s overall ability In the event of no bias, only the test score will account forsystematic covariance between the item responses on a particular item If bias does af-fect a particular item, the variable encoding gender will covary with the item responseindependently of the covariance between the score and the outcome Further, if bias isassociated with particular levels of ability on the latent score continuum, a nonuniformDIF can be diagnosed with a Gender × Total Score interaction term
Item response = constant + gender + score + (gender × score)
In the event a nonuniform DIF is confirmed not to exist, the interaction term can
be deleted to yield a main effect for gender, controlling for test score Gender fects are then tested for nonrandomness against a t-distribution
ef-The third empirical method used a simultaneous item bias utilizing item responsetheory (Shealy & Stout, 1993) The SIB approach was performed on each of the 20items in turn The sums of the all other items were used in rotation as ability estimates
in matching male and female examinees via a regression approach This approachemploys the matching strategy of the Mantel-Haenszel method, and uses the totalscore based on k-1 items as a concurrent covariate for each of the item bias tests Dif-ferences in estimates of DIF were evaluated against a z distribution
The composite results of the three different approaches to estimating DIF for each
of the 20 items are given in Table 3 Each of the three objective measures employs adifferent test statistic to assess the likelihood of the observed bias statistic Analo-gous to meta-analytic methods, the different effects can be assessed standardized asmetrics To this end, each DIF estimate, controlled for overall candidate ability, ispresented as a conventional probability (p < 05) of rejecting the null hypothesis.Table 3 indicates that the Mantel-Haenszel and SIB approaches are equally par-simonious in detecting gender bias on the 20-item test Both of these methods em-ploy ability matches of men and women along the latent ability continuum In con-
Trang 11trast, the logistic regression approach, which uses the total score as a covariate,appears slightly more likely to detect bias All three methods concur in detectinggender bias on the Soccer item 13 shown in Table 4.
ANALYSIS: SUBJECTIVE ESTIMATES OF BIAS
The panel of 97 preservice and in-service teachers was categorized into male andfemale subgroups of novices and experienced teachers based on survey responses
TABLE 3 Objective Bias Probabilities Per Item
13 If the Fighters defeat the Sharks by a score of 1–0 then:
Trang 12The aim of this subdivision of the subjective raters was to explore possible sources
of differential sensitivity to bias in the test questions In contrast to the objectivemethods of diagnosing item bias, the subjective ratings do not employ any infor-mation about individual ability inferred from the total score Subjective estimatesrely completely on the apparent schematic content and presumed world knowledgeneeded to answer each item Further, because ratings were on a Likert type scale,differences between the observed mean rating and the null hypothesis needed to betested to provide bias probabilities1comparable to those in Table 3 To this end, themean rating of gender bias on each of the 20 items was tested against the hypothe-sis that male versus female advantage on each item equaled zero Table 5 containsthe subjectively estimated probabilities that each item is biased
In contrast with the objective measures of bias, the subjective analysis ses considerably more bias Complete subjective agreement in diagnosing bias oc-curs in 6 of the 20 items As the third column in Table 5 suggests, it appears that ex-perienced male (EM) teachers are the most inclined to assume there is gender bias.This subgroup in fact sees bias in the majority of items Experienced female teach-ers, in contrast, are the most conservative in assuming that schematic content indi-cate possible test item bias The novice male teachers-in-training correspond totheir more experienced male counterparts in assuming there is a schematic bias intwo of the four test passages
Of particular interest is the tendency of the subjective raters to apply the bias sis not to individual items, but to entire test passages It appears likely that these male re-spondents equate content sensitivity with test bias Sources of this confusion will be ex-amined in the narrative accounts provided by some of the male teachers The tendency tosee content schema as bias suggests that subjective raters see topical domains as the keysource of possible bias Both male and female in-service teachers would be expected toshare equivalently accurate knowledge about the cumulative consequences of socializa-tion on Japanese teenagers’ world-knowledge As Table 5 would suggest, however, theexperienced male teachers appear to overgeneralize the extent of possible schematicknowledge differences between male and female students The domains that appear to
diagno-be high bias risk to male teachers involve spatial processing (Visuals 6–9), all of theitems concerned with a sports tournament (Soccer 11–15), and all of the items about thepassage describing changes in transportation (Transport 16–20)
Subjective Accounts of Bias
As a way of accounting for the presumed bias in the test items, interviews withthree veteran male instructors were undertaken These interviews provide a post
1 Subjective ratings were tested against a null hypothesis by assuming the population mean bias (mu) equals zero and testing the observed subjective mean against a single sample t distribution Exact probabilities of the observed t tests were then used in Table 3.
Trang 13hoc reflective account of why bias would be expected in test items The three counts provide subjective evidence as to the sources of the putative sensitivity orbias in test items Three facets of belief about gender differences were included inthe interview The first was a global impression of the test materials The secondwas concerned with the four passages used in the actual reading test The thirdquestion in the interview phase addressed how each male teacher assumes his col-leagues are aware of gender differences among students.
ac-Teacher A (mid-30s, male)
Overall impression and belief
“In general, I believe that boys are better than girls at understanding entific and logical essays Having said so, I normally don’t pay attention tosuch gender differences The actual English test materials, when comparedwith Japanese materials, the topics are easier and the contents are less com-plicated Thus I tend to focus students on how to solve the tasks and gethigher scores rather than on the comprehension of the contents In thissense, rather than the topic, the form of the tasks may cause a different per-
sci-TABLE 5 Subjective Bias Probabilities Per Item