Edited by JOHN FLOWERDEW University of Leeds ALI SHEHADEH United Arab Emirates University Differential Item Functioning on an English Listening Test Across Gender GI-PYO PARK Soonchunhya
Trang 1BRIEF REPORTS AND SUMMARIES
TESOL Quarterly invites readers to submit short reports and updates on their work These summaries may address any areas of interest to Quarterly readers.
Edited by JOHN FLOWERDEW
University of Leeds
ALI SHEHADEH
United Arab Emirates University
Differential Item Functioning on an English
Listening Test Across Gender
GI-PYO PARK
Soonchunhyang University
Seoul, South Korea
䡲 Differential item functioning (DIF) is present when two groups of equal ability show a differential probability of a correct response to an
item (Ellis & Raju, 2003) Two groups refers to the focal group, or group of primary interest, and the reference group, or a standard group against which the focal group is compared Equal ability in group comparison is
important because it tells whether group differences in ability arise from
true differences or item bias (Elder, 1997) Differential probability deals with item difficulty, or uniform DIF, and item discrimination, or
nonuni-form DIF Studies on DIF are crucial because DIF deals with fairness of
items across groups, say gender and socioeconomic status, beyond group-mean differences (Thissen, Steinberg, & Gerrard, 1986)
Several studies on DIF have investigated whether gender differences
in test performance resulted from gender bias (Drasgow, 1987; Ryan & Bachman, 1992; Takala & Kaftandjieva, 2000) Using item response theory, Drasgow investigated whether the ACT Mathematics Usage Test, which consisted of 40 items, functioned differentially across gender and race He found that five items showed evidence of DIF However, test-characteristic curves, which are the sum of the item-test-characteristic curves, identified no group differences in the cumulative effects of DIF in the test as a whole
Ryan and Bachman (1992) detected DIF in the Test of English as a
Trang 2Foreign Language (TOEFL) and in the First Certificate of English (FCE) using the Mantel-Haenszel procedure across gender and language back-ground (Indo-European/non-Indo-European) In terms of gender, four
of the 140 TOEFL items favored males, and two TOEFL items favored females; in the FCE, one of the 38 items favored males, and one item favored females For language background, 32 TOEFL items were easier for Indo-European native speakers, and 33 TOEFL items were easier for non-Indo-European native speakers; in the FCE, 13 items were easier for Indo-European native speakers, and 12 were easier for non-Indo-European native speakers
Takala and Kaftandjieva (2000) examined whether the vocabulary subtest of the Finnish Foreign Language Certificate Examination (FFLCE) showed evidence of DIF The FFLCE had 11 items showing DIF, with six items favoring males and five items favoring females Despite these findings, however, excluding the DIF items in the test did not affect the ability parameter estimations between males and females probably because DIF items canceled each other out
Previous studies on DIF across gender have identified DIF using vari-ous methods such as the Mantel-Haenszel procedure, item response theory, and confirmatory factor analysis However, few have further ana-lyzed the sources of DIF in terms of variables such as language type (dialogue and monologue), question type (local, global, and expres-sion), content about text, and picture presence in the item (Engelhard, Hansche, & Rutledge, 1990; Pae, 2002; Roznowski, 1987)
The purpose of this study was to identify DIF across gender using the Mantel-Haenszel procedure in the English listening part of the 2003 Korea College Scholastic Ability Test (KCSAT) Another purpose of this study was to further articulate the sources of DIF in relation to four important variables considered when the English listening test was de-veloped: language type, question type, content about text, and picture presence Two research questions guide this study: (a) Does the English listening test in the 2003 KCSAT include items displaying DIF across gender? (b) If so, what are the sources of DIF? The answers to these questions will sensitize item developers to the issue of DIF, which will, in turn, help them to develop items free from bias based on gender, socio-economic status, and other factors
METHOD
Participants
The participants were 20,000 males (half in liberal arts and half in science) and 20,000 females (half in liberal arts and half in science) chosen from the 675,922 examinees who took the 2003 KCSAT With the
Trang 3cooperation of the Korea Institute of Curriculum and Evaluation (KICE), the participants were chosen based solely on gender and aca-demic background (liberal arts versus sciences) to minimize confound-ing variables
The Mantel-Haenszel Procedure
The Mantel-Haenszel procedure was chosen to detect DIF because the procedure is easy to use and because it is widely accepted as a measure
of DIF The procedure was introduced by Mantel and Haenszel (1959), and adapted by Holland and Thayer (1988), to identify items displaying DIF across members of different subgroups Using the Mantel-Haenszel
2
statistic with one degree of freedom, the Mantel-Haenszel procedure
tests the null hypothesis in the equation, H 0 : ␣ = 1, where ␣ is the
common odds ratio
Instrument
The English listening test from the 2003 KCSAT was used as an in-strument because KCSAT is a high stakes test that fits the study of DIF KCSAT plays a critical role in deciding admission to college in Korea The English listening test was developed in about a month by a special testing committee appointed by the KICE The committee consisted of English professors and teachers who had expertise in developing test items KICE asked them to develop various items, specifically considering language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content (Nunan, 1991; Shohamy
& Inbar, 1991) The draft of the test was reviewed twice by two different review committees consisting of high school teachers The review com-mittees were asked to estimate item difficulty and to screen out any items that were similar to the items already used elsewhere Pretesting to in-vestigate the psychometric properties of the test was not possible for security reasons
The English listening test was a multiple-choice format, consisting of
17 items about different language types, question types, picture pres-ence, and content In terms of language type, the test included 14 items about dialogues and three items about monologues For question type, it included eight global questions asking for inferential information, four local questions asking for factual information, and five questions asking about appropriate expression With regard to picture presence, the test consisted of two picture items and 15 nonpicture items In addition, the English listening test covered a variety of content such as exchanging information on sports, a health club, and a city; discussing a customer
Trang 4complaint and problems in class; describing views; visiting a patient; asking a person to record a TV program; asking citizens to help in a festival; advising students to behave properly; planning for the weekend; going out for dinner; and identifying a person in a picture
Each text about dialogues and monologues in the test comprised about 85 words with 9–11 turns between speakers in dialogues The text was recorded by two native speakers of English, one male and one fe-male, at a speed of about 140 words per minute The reliability of the test
as measured by Cronbach’s␣ was 0.802
FINDINGS
Before identifying DIF across gender with the Mantel-Haenszel pro-cedure in the English listening test, dimensionality and item difficulty of
the test were investigated Dimensionality, or the number of latent
dimen-sions, was investigated by principal component analysis (see Table 1)
The listening test was either unidimensional (the test measured only one ability), determined by the variance of the first factor, 24.60%, or
multi-dimensional (the test measured more than one ability), determined by the
eigenvalue of the first factor, 4.18 (Reckase, 1979).1This result indicated that the Mantel-Haenszel procedure rather than item response theory, which assumes unidimensionality, should be used to investigate whether the test functioned differently across gender with this data (Hambleton, Swaminathan, & Rogers, 1991)
To provide a rationale for studying the DIF, item difficulty statistics
were calculated before matching ability levels, followed by t tests for
males and females Only two items (2 and 12) out of the 17 items in the listening test were significantly easier for males, whereas 13 items (3–9,
11, 13, 14–17) were significantly easier for females This finding indicates that the female participants had better foreign language listening ability
1 Reckase (1979) argued that for the first factor to control the estimation of the parameters,
it should have an eigenvalue of 10 or greater or account for at least 20% of the total variance.
TABLE 1 Results of Principal Component Analysis
Trang 5than the male participants (Ryan & Bachman, 1992; 2002 TOEIC Through Data, 2002) However, because these statistics were calculated before ability levels were matched between the two groups, the results could not tell whether the group differences in item difficulty resulted from true group differences in ability or item bias (Elder, 1997) Thus, after ability levels were matched, differential item functioning for the two groups in the test was investigated in depth
Table 2 shows the results of DIF analysis with the Mantel-Haenszel procedure, which uses total scores as a matching criterion When the two groups were matched with total scores, a total of 13 items out of the 17 items in the test showed DIF, with 6 items (1, 2, 6, 10, 12, and 13) differentially easier for males and 7 items (4, 5, 7, 8, 9, 11, and 17) differentially easier for females These findings suggest that item diffi-culty statistics should be interpreted with caution because DIF could be present beyond the item difficulty indices (Thissen et al., 1986) Even though the English listening test of the KCSAT had as many as 13 DIF items, the number of DIF items for males and females was almost equal, indicating that DIF items might cancel out each other in the test level analysis (Drasgow, 1987; Takala & Kaftandjieva, 2000)
The items showing DIF were further analyzed to determine whether language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content were associated with DIF Table 3 reports that, in general, the DIF items were related to language type and picture presence This relationship, however, was
TABLE 2 Identification of DIF After Matching Ability Levels
** Item favored the focal group (females); * item favored the reference group (males).
Trang 6different from previous findings, which showed that picture presence was easier for females (Pae, 2002) More specifically, four items (1, 2, 12, and 13) favored males and six items (4, 5, 7, 8, 9, and 11) favored females in the dialogues, whereas two items (6 and 10) favored males and one item (17) favored females in the monologues It is interesting that both of the picture items (1 and 13) favored males In question type, however, gen-der differences in the number of DIF items were not found In the local questions asking for factual information, one item (1) was easier for males, whereas two items (5 and 8) were easier for females In the global questions asking for inferential information, four items (2, 6, 10, and 12) were easier for males and four items (4, 7, 9, and 11) were easier for females In expression questions asking for appropriate expression, one item favored males (13) and one favored females (17) Interpreting DIF
in relation to content was difficult because the English listening test covered such a wide variety of content The coverage was so broad be-cause the test makers sought to minimize the examinee’s background knowledge effects on the test and to maximize content validity (Chiang
& Dunkel, 1992; Park, 2004) As Roznowski (1987) reported, however, shopping (4 and 8) and theater (3) contents favored females, whereas sports (12) and travel (2) contents favored males However, in this study farming (15) and health (14) did not show DIF, in contrast with Roznowski’s study
TABLE 3 Analysis of DIF by Language Type, Question Type, Content, and Picture Presence
Item
Language
type
Question
Picture presence
DIF Status
2 Dialogue Global Describing views from a mountain No Male
3 Dialogue Local Planning for the weekend: Going
to the theater
4 Dialogue Global Discussing a customer complaint No Female
5 Dialogue Local Asking a person to record a TV
program
6 Monologue Global Advising students to behave properly No Male
9 Dialogue Global Exchanging information about a city No Female
10 Monologue Global Asking citizens to help in a festival No Male
11 Dialogue Global Discussing problems in class No Female
12 Dialogue Global Exchanging information about sports No Male
13 Dialogue Expression Catching a dog running away Picture Male
14 Dialogue Expression Exchanging information about a
health club
17 Monologue Expression Clearing snow off the sidewalks No Female
Trang 7This study used the Mantel-Haenszel procedure to investigate whether test items were invariant across gender in the English listening test of the
2003 KCSAT Of the 17 items on the listening test, 13 items displayed DIF; 6 items favored males, and 7 items favored females In a closer investigation, four important variables in developing the test—content about text, picture presence in the items, language type, and question type—were all associated with DIF to different degrees
The findings of this study have several implications First, test items should be pretested for any problems in psychometric properties includ-ing DIF before they are used If any items show DIF, the items should be revised or eliminated after thoughtful evaluation by the selection com-mittee or bias reviewers It is important to note that even though a subtest shows almost equal numbers of DIF items for each group, say 6 items favoring one group and 7 items favoring the other group, the result can be consequential for examinees at the total test score (see Maller, 2001) This problem arises when a raw score difference between the focal group and the reference group in a subtest is accumulated in the total test score In this scenario, the total raw score difference can be substantial, leading to unfairness of the test across groups
Second, in case test items cannot be pretested for security reasons, as for the KCSAT, the selection committee should carefully choose items free from possible bias across groups by considering many variables such
as content, picture presence, language type, and question type As dis-cussed earlier, the shopping content in the listening test of 2003 KCSAT favored females, whereas the items with picture presence favored males
In this case, if item developers put together both the variables of the shopping content and picture presence in an item, the item may show minimal DIF or may not flag for DIF at all
Specific care needs to be taken in terms of content because different content favors different groups and exclusion of content in a test can cause problems in validity For instance, if the contents of shopping and sports are excluded in a listening test, the test may be free from DIF as seen in this study However, the test may also suffer from a lack of content validity because it fails to cover the universe of items To tackle these intricate problems, the committee can choose items of various contents about text which may flag for DIF but cancel each other out in the test as a whole (Clauser & Mazor, 1998) What should be noted, however, is that the studies of DIF to date have not shown whether the accumulation of DIF items cancel each other out in the test level analysis Third, item developers should assume professional responsibility for developing items that are as fair as possible by considering as many variables as possible (Carlton & Harris, 1992) Some may argue that it is
Trang 8practically impossible to consider all these variables in developing test items However, considering the personal and social ramifications of high-stakes tests, every effort should be made to develop items that are free from bias
This study suggests the following future inquiries: First, this study identified DIF based on statistical analyses A logical next concern is to explore further whether bias reviewers can identify test items showing DIF without statistical data (Engelhard et al., 1990) Second, it should be investigated whether items showing DIF in a test manifest differential test functioning (DTF) Even though several empirical studies on DTF have been undertaken (Takala & Kaftandjieva, 2000; Zumbo, 2003), we are not sure yet whether (a) a test with DIF items shows DTF because of DIF accumulation in the test level analysis, (b) a test with DIF items shows no DTF because of DIF cancellation in the test level analysis, or (c) a test with DIF items shows no DTF because DIF is independent from DTF
ACKNOWLEDGMENTS
I express my deep gratitude to the anonymous TESOL Quarterly reviewers for their
insightful comments on an earlier draft of this study.
THE AUTHOR
Gi-Pyo Park is a professor of teaching English as a foreign language at Soonchun-hyang University in Seoul, South Korea His research interests include testing, learn-ing strategies, and listenlearn-ing and readlearn-ing comprehension.
REFERENCES
Carlton, S., & Harris, A (1992) Characteristics associated with differential item functioning
on the scholastic aptitude test: Gender and majority/minority group comparisons (Research
Report No 92, pp 60–70) Princeton, NJ: ETS.
Chiang, C., & Dunkel, P (1992) The effect of speech modification, prior knowledge,
and listening proficiency on EFL lecture learning TESOL Quarterly, 26, 345–374.
Clauser, B E., & Mazor, K M (1998) Using statistical procedures to identify
differ-entially functioning test items Educational Measurement: Issues and Practice, 17,
31–47.
Drasgow, F (1987) Study of the measurement bias of two standardized psychological
tests The Journal of Applied Psychology, 72, 19–29.
Elder, C (1997) What does test bias have to do with fairness? Language Testing, 14,
261–277.
Ellis, B., & Raju, N (2003) Test and item bias: What they are, what they aren’t, and how
to detect them Washington, DC: U.S Department of Education (ERIC Document
Reproduction Service No ED 480042)
Engelhard, G., Hansche, L., & Rutledge, K (1990) Accuracy of bias review judges in
identifying differential item functioning on teacher certification tests Applied Measurement in Education, 3, 347–360.
Trang 9Hambleton, R K., Swaminathan, H., & Rogers, H J (1991) Fundamentals of item response theory Thousand Oakes, CA: Sage.
Holland, P W., & Thayer, D T (1988) Differential item performance and
Mantel-Haenszel procedure In H Wainer & H Braun (Eds.), Test validity (pp 129–145).
Hillsdale, NC: Lawrence Erlbaum.
Maller, S (2001) Differential item functioning in the WISC-III: Item parameters for
boys and girls in the national standardization sample Educational and Psychological Measurement, 61, 793–817.
Mantel, N., & Haenszel, W (1959) Statistical aspects of the analysis of data from
retrospective studies of disease Journal of the National Cancer Institute, 22, 719–748 Nunan, D (1991) Language teaching methodology: A textbook for teachers New York:
Prentice-Hall.
Pae, T.-I (2002) Gender differential item functioning on a national language test
Unpub-lished doctoral dissertation, Purdue University, West Lafayette, Indiana, United States.
Park, G.-P (2004) Comparison of L2 listening and reading comprehension by
uni-versity students learning English in Korea Foreign Language Annals, 37, 448–458.
Reckase, M (1979) Unifactor latent trait models applied to multifactor tests: Results
and implications Journal of Educational Statistics, 4, 207–230.
Roznowski, M (1987) Use of tests manifesting sex differences as measures of
intel-ligence: Implications for measurement bias The Journal of Applied Psychology, 72,
480–483.
Ryan, K., & Bachman, L F (1992) Differential item functioning on two tests of EFL
proficiency Language Testing, 9, 12–29.
Shohamy, E., & Inbar, O (1991) Validation of listening comprehension tests: The
effect of text and question type Language Testing, 8, 23–40.
Takala, S., & Kaftandjieva, F (2000) Test fairness: A DIF analysis of an L2 vocabulary
test Language Testing, 17, 323–340.
Thissen, D., Steinberg, L., & Gerrard, M (1986) Beyond group mean differences:
The concept of item bias Psychological Bulletin, 99, 118–128.
2002 TOEIC through data (2002) TOEIC Newsletter, 17, 2–7.
Zumbo, B (2003) Does item-level DIF manifest itself in scale-level analysis?
Impli-cations for translating language tests Language Testing, 20, 136–147.