Differential Item Functioning on an English Listening Test Across Gender

Edited by JOHN FLOWERDEW University of Leeds ALI SHEHADEH United Arab Emirates University Differential Item Functioning on an English Listening Test Across Gender GI-PYO PARK Soonchunhya

Trang 1

BRIEF REPORTS AND SUMMARIES

TESOL Quarterly invites readers to submit short reports and updates on their work These summaries may address any areas of interest to Quarterly readers.

Edited by JOHN FLOWERDEW

University of Leeds

ALI SHEHADEH

United Arab Emirates University

Differential Item Functioning on an English

Listening Test Across Gender

GI-PYO PARK

Soonchunhyang University

Seoul, South Korea

䡲 Differential item functioning (DIF) is present when two groups of equal ability show a differential probability of a correct response to an

item (Ellis & Raju, 2003) Two groups refers to the focal group, or group of primary interest, and the reference group, or a standard group against which the focal group is compared Equal ability in group comparison is

important because it tells whether group differences in ability arise from

true differences or item bias (Elder, 1997) Differential probability deals with item difficulty, or uniform DIF, and item discrimination, or

nonuni-form DIF Studies on DIF are crucial because DIF deals with fairness of

items across groups, say gender and socioeconomic status, beyond group-mean differences (Thissen, Steinberg, & Gerrard, 1986)

Several studies on DIF have investigated whether gender differences

in test performance resulted from gender bias (Drasgow, 1987; Ryan & Bachman, 1992; Takala & Kaftandjieva, 2000) Using item response theory, Drasgow investigated whether the ACT Mathematics Usage Test, which consisted of 40 items, functioned differentially across gender and race He found that five items showed evidence of DIF However, test-characteristic curves, which are the sum of the item-test-characteristic curves, identified no group differences in the cumulative effects of DIF in the test as a whole

Ryan and Bachman (1992) detected DIF in the Test of English as a

Trang 2

Foreign Language (TOEFL) and in the First Certificate of English (FCE) using the Mantel-Haenszel procedure across gender and language back-ground (Indo-European/non-Indo-European) In terms of gender, four

of the 140 TOEFL items favored males, and two TOEFL items favored females; in the FCE, one of the 38 items favored males, and one item favored females For language background, 32 TOEFL items were easier for Indo-European native speakers, and 33 TOEFL items were easier for non-Indo-European native speakers; in the FCE, 13 items were easier for Indo-European native speakers, and 12 were easier for non-Indo-European native speakers

Takala and Kaftandjieva (2000) examined whether the vocabulary subtest of the Finnish Foreign Language Certificate Examination (FFLCE) showed evidence of DIF The FFLCE had 11 items showing DIF, with six items favoring males and five items favoring females Despite these findings, however, excluding the DIF items in the test did not affect the ability parameter estimations between males and females probably because DIF items canceled each other out

Previous studies on DIF across gender have identified DIF using vari-ous methods such as the Mantel-Haenszel procedure, item response theory, and confirmatory factor analysis However, few have further ana-lyzed the sources of DIF in terms of variables such as language type (dialogue and monologue), question type (local, global, and expres-sion), content about text, and picture presence in the item (Engelhard, Hansche, & Rutledge, 1990; Pae, 2002; Roznowski, 1987)

The purpose of this study was to identify DIF across gender using the Mantel-Haenszel procedure in the English listening part of the 2003 Korea College Scholastic Ability Test (KCSAT) Another purpose of this study was to further articulate the sources of DIF in relation to four important variables considered when the English listening test was de-veloped: language type, question type, content about text, and picture presence Two research questions guide this study: (a) Does the English listening test in the 2003 KCSAT include items displaying DIF across gender? (b) If so, what are the sources of DIF? The answers to these questions will sensitize item developers to the issue of DIF, which will, in turn, help them to develop items free from bias based on gender, socio-economic status, and other factors

METHOD

Participants

The participants were 20,000 males (half in liberal arts and half in science) and 20,000 females (half in liberal arts and half in science) chosen from the 675,922 examinees who took the 2003 KCSAT With the

Trang 3

cooperation of the Korea Institute of Curriculum and Evaluation (KICE), the participants were chosen based solely on gender and aca-demic background (liberal arts versus sciences) to minimize confound-ing variables

The Mantel-Haenszel Procedure

The Mantel-Haenszel procedure was chosen to detect DIF because the procedure is easy to use and because it is widely accepted as a measure

of DIF The procedure was introduced by Mantel and Haenszel (1959), and adapted by Holland and Thayer (1988), to identify items displaying DIF across members of different subgroups Using the Mantel-Haenszel

␹2

statistic with one degree of freedom, the Mantel-Haenszel procedure

tests the null hypothesis in the equation, H 0 : ␣ = 1, where ␣ is the

common odds ratio

Instrument

The English listening test from the 2003 KCSAT was used as an in-strument because KCSAT is a high stakes test that fits the study of DIF KCSAT plays a critical role in deciding admission to college in Korea The English listening test was developed in about a month by a special testing committee appointed by the KICE The committee consisted of English professors and teachers who had expertise in developing test items KICE asked them to develop various items, specifically considering language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content (Nunan, 1991; Shohamy

& Inbar, 1991) The draft of the test was reviewed twice by two different review committees consisting of high school teachers The review com-mittees were asked to estimate item difficulty and to screen out any items that were similar to the items already used elsewhere Pretesting to in-vestigate the psychometric properties of the test was not possible for security reasons

The English listening test was a multiple-choice format, consisting of

17 items about different language types, question types, picture pres-ence, and content In terms of language type, the test included 14 items about dialogues and three items about monologues For question type, it included eight global questions asking for inferential information, four local questions asking for factual information, and five questions asking about appropriate expression With regard to picture presence, the test consisted of two picture items and 15 nonpicture items In addition, the English listening test covered a variety of content such as exchanging information on sports, a health club, and a city; discussing a customer

Trang 4

complaint and problems in class; describing views; visiting a patient; asking a person to record a TV program; asking citizens to help in a festival; advising students to behave properly; planning for the weekend; going out for dinner; and identifying a person in a picture

Each text about dialogues and monologues in the test comprised about 85 words with 9–11 turns between speakers in dialogues The text was recorded by two native speakers of English, one male and one fe-male, at a speed of about 140 words per minute The reliability of the test

as measured by Cronbach’s␣ was 0.802

FINDINGS

Before identifying DIF across gender with the Mantel-Haenszel pro-cedure in the English listening test, dimensionality and item difficulty of

the test were investigated Dimensionality, or the number of latent

dimen-sions, was investigated by principal component analysis (see Table 1)

The listening test was either unidimensional (the test measured only one ability), determined by the variance of the first factor, 24.60%, or

multi-dimensional (the test measured more than one ability), determined by the

eigenvalue of the first factor, 4.18 (Reckase, 1979).1This result indicated that the Mantel-Haenszel procedure rather than item response theory, which assumes unidimensionality, should be used to investigate whether the test functioned differently across gender with this data (Hambleton, Swaminathan, & Rogers, 1991)

To provide a rationale for studying the DIF, item difficulty statistics

were calculated before matching ability levels, followed by t tests for

males and females Only two items (2 and 12) out of the 17 items in the listening test were significantly easier for males, whereas 13 items (3–9,

11, 13, 14–17) were significantly easier for females This finding indicates that the female participants had better foreign language listening ability

1 Reckase (1979) argued that for the first factor to control the estimation of the parameters,

it should have an eigenvalue of 10 or greater or account for at least 20% of the total variance.

TABLE 1 Results of Principal Component Analysis

Trang 5

than the male participants (Ryan & Bachman, 1992; 2002 TOEIC Through Data, 2002) However, because these statistics were calculated before ability levels were matched between the two groups, the results could not tell whether the group differences in item difficulty resulted from true group differences in ability or item bias (Elder, 1997) Thus, after ability levels were matched, differential item functioning for the two groups in the test was investigated in depth

Table 2 shows the results of DIF analysis with the Mantel-Haenszel procedure, which uses total scores as a matching criterion When the two groups were matched with total scores, a total of 13 items out of the 17 items in the test showed DIF, with 6 items (1, 2, 6, 10, 12, and 13) differentially easier for males and 7 items (4, 5, 7, 8, 9, 11, and 17) differentially easier for females These findings suggest that item diffi-culty statistics should be interpreted with caution because DIF could be present beyond the item difficulty indices (Thissen et al., 1986) Even though the English listening test of the KCSAT had as many as 13 DIF items, the number of DIF items for males and females was almost equal, indicating that DIF items might cancel out each other in the test level analysis (Drasgow, 1987; Takala & Kaftandjieva, 2000)

The items showing DIF were further analyzed to determine whether language type (dialogue and monologue), question type (local, global, and expression), picture presence, and content were associated with DIF Table 3 reports that, in general, the DIF items were related to language type and picture presence This relationship, however, was

TABLE 2 Identification of DIF After Matching Ability Levels

** Item favored the focal group (females); * item favored the reference group (males).

Trang 6

different from previous findings, which showed that picture presence was easier for females (Pae, 2002) More specifically, four items (1, 2, 12, and 13) favored males and six items (4, 5, 7, 8, 9, and 11) favored females in the dialogues, whereas two items (6 and 10) favored males and one item (17) favored females in the monologues It is interesting that both of the picture items (1 and 13) favored males In question type, however, gen-der differences in the number of DIF items were not found In the local questions asking for factual information, one item (1) was easier for males, whereas two items (5 and 8) were easier for females In the global questions asking for inferential information, four items (2, 6, 10, and 12) were easier for males and four items (4, 7, 9, and 11) were easier for females In expression questions asking for appropriate expression, one item favored males (13) and one favored females (17) Interpreting DIF

in relation to content was difficult because the English listening test covered such a wide variety of content The coverage was so broad be-cause the test makers sought to minimize the examinee’s background knowledge effects on the test and to maximize content validity (Chiang

& Dunkel, 1992; Park, 2004) As Roznowski (1987) reported, however, shopping (4 and 8) and theater (3) contents favored females, whereas sports (12) and travel (2) contents favored males However, in this study farming (15) and health (14) did not show DIF, in contrast with Roznowski’s study

TABLE 3 Analysis of DIF by Language Type, Question Type, Content, and Picture Presence

Item

Language

type

Question

Picture presence

DIF Status

2 Dialogue Global Describing views from a mountain No Male

3 Dialogue Local Planning for the weekend: Going

to the theater

4 Dialogue Global Discussing a customer complaint No Female

5 Dialogue Local Asking a person to record a TV

program

6 Monologue Global Advising students to behave properly No Male

9 Dialogue Global Exchanging information about a city No Female

10 Monologue Global Asking citizens to help in a festival No Male

11 Dialogue Global Discussing problems in class No Female

12 Dialogue Global Exchanging information about sports No Male

13 Dialogue Expression Catching a dog running away Picture Male

14 Dialogue Expression Exchanging information about a

health club

17 Monologue Expression Clearing snow off the sidewalks No Female

Trang 7

This study used the Mantel-Haenszel procedure to investigate whether test items were invariant across gender in the English listening test of the

2003 KCSAT Of the 17 items on the listening test, 13 items displayed DIF; 6 items favored males, and 7 items favored females In a closer investigation, four important variables in developing the test—content about text, picture presence in the items, language type, and question type—were all associated with DIF to different degrees

The findings of this study have several implications First, test items should be pretested for any problems in psychometric properties includ-ing DIF before they are used If any items show DIF, the items should be revised or eliminated after thoughtful evaluation by the selection com-mittee or bias reviewers It is important to note that even though a subtest shows almost equal numbers of DIF items for each group, say 6 items favoring one group and 7 items favoring the other group, the result can be consequential for examinees at the total test score (see Maller, 2001) This problem arises when a raw score difference between the focal group and the reference group in a subtest is accumulated in the total test score In this scenario, the total raw score difference can be substantial, leading to unfairness of the test across groups

Second, in case test items cannot be pretested for security reasons, as for the KCSAT, the selection committee should carefully choose items free from possible bias across groups by considering many variables such

as content, picture presence, language type, and question type As dis-cussed earlier, the shopping content in the listening test of 2003 KCSAT favored females, whereas the items with picture presence favored males

In this case, if item developers put together both the variables of the shopping content and picture presence in an item, the item may show minimal DIF or may not flag for DIF at all

Specific care needs to be taken in terms of content because different content favors different groups and exclusion of content in a test can cause problems in validity For instance, if the contents of shopping and sports are excluded in a listening test, the test may be free from DIF as seen in this study However, the test may also suffer from a lack of content validity because it fails to cover the universe of items To tackle these intricate problems, the committee can choose items of various contents about text which may flag for DIF but cancel each other out in the test as a whole (Clauser & Mazor, 1998) What should be noted, however, is that the studies of DIF to date have not shown whether the accumulation of DIF items cancel each other out in the test level analysis Third, item developers should assume professional responsibility for developing items that are as fair as possible by considering as many variables as possible (Carlton & Harris, 1992) Some may argue that it is

Trang 8

practically impossible to consider all these variables in developing test items However, considering the personal and social ramifications of high-stakes tests, every effort should be made to develop items that are free from bias

This study suggests the following future inquiries: First, this study identified DIF based on statistical analyses A logical next concern is to explore further whether bias reviewers can identify test items showing DIF without statistical data (Engelhard et al., 1990) Second, it should be investigated whether items showing DIF in a test manifest differential test functioning (DTF) Even though several empirical studies on DTF have been undertaken (Takala & Kaftandjieva, 2000; Zumbo, 2003), we are not sure yet whether (a) a test with DIF items shows DTF because of DIF accumulation in the test level analysis, (b) a test with DIF items shows no DTF because of DIF cancellation in the test level analysis, or (c) a test with DIF items shows no DTF because DIF is independent from DTF

ACKNOWLEDGMENTS

I express my deep gratitude to the anonymous TESOL Quarterly reviewers for their

insightful comments on an earlier draft of this study.

THE AUTHOR

Gi-Pyo Park is a professor of teaching English as a foreign language at Soonchun-hyang University in Seoul, South Korea His research interests include testing, learn-ing strategies, and listenlearn-ing and readlearn-ing comprehension.

REFERENCES

Carlton, S., & Harris, A (1992) Characteristics associated with differential item functioning

on the scholastic aptitude test: Gender and majority/minority group comparisons (Research

Report No 92, pp 60–70) Princeton, NJ: ETS.

Chiang, C., & Dunkel, P (1992) The effect of speech modification, prior knowledge,

and listening proficiency on EFL lecture learning TESOL Quarterly, 26, 345–374.

Clauser, B E., & Mazor, K M (1998) Using statistical procedures to identify

differ-entially functioning test items Educational Measurement: Issues and Practice, 17,

31–47.

Drasgow, F (1987) Study of the measurement bias of two standardized psychological

tests The Journal of Applied Psychology, 72, 19–29.

Elder, C (1997) What does test bias have to do with fairness? Language Testing, 14,

261–277.

Ellis, B., & Raju, N (2003) Test and item bias: What they are, what they aren’t, and how

to detect them Washington, DC: U.S Department of Education (ERIC Document

Reproduction Service No ED 480042)

Engelhard, G., Hansche, L., & Rutledge, K (1990) Accuracy of bias review judges in

identifying differential item functioning on teacher certification tests Applied Measurement in Education, 3, 347–360.

Trang 9

Hambleton, R K., Swaminathan, H., & Rogers, H J (1991) Fundamentals of item response theory Thousand Oakes, CA: Sage.

Holland, P W., & Thayer, D T (1988) Differential item performance and

Mantel-Haenszel procedure In H Wainer & H Braun (Eds.), Test validity (pp 129–145).

Hillsdale, NC: Lawrence Erlbaum.

Maller, S (2001) Differential item functioning in the WISC-III: Item parameters for

boys and girls in the national standardization sample Educational and Psychological Measurement, 61, 793–817.

Mantel, N., & Haenszel, W (1959) Statistical aspects of the analysis of data from

retrospective studies of disease Journal of the National Cancer Institute, 22, 719–748 Nunan, D (1991) Language teaching methodology: A textbook for teachers New York:

Prentice-Hall.

Pae, T.-I (2002) Gender differential item functioning on a national language test

Unpub-lished doctoral dissertation, Purdue University, West Lafayette, Indiana, United States.

Park, G.-P (2004) Comparison of L2 listening and reading comprehension by

uni-versity students learning English in Korea Foreign Language Annals, 37, 448–458.

Reckase, M (1979) Unifactor latent trait models applied to multifactor tests: Results

and implications Journal of Educational Statistics, 4, 207–230.

Roznowski, M (1987) Use of tests manifesting sex differences as measures of

intel-ligence: Implications for measurement bias The Journal of Applied Psychology, 72,

480–483.

Ryan, K., & Bachman, L F (1992) Differential item functioning on two tests of EFL

proficiency Language Testing, 9, 12–29.

Shohamy, E., & Inbar, O (1991) Validation of listening comprehension tests: The

effect of text and question type Language Testing, 8, 23–40.

Takala, S., & Kaftandjieva, F (2000) Test fairness: A DIF analysis of an L2 vocabulary

test Language Testing, 17, 323–340.

Thissen, D., Steinberg, L., & Gerrard, M (1986) Beyond group mean differences:

The concept of item bias Psychological Bulletin, 99, 118–128.

2002 TOEIC through data (2002) TOEIC Newsletter, 17, 2–7.

Zumbo, B (2003) Does item-level DIF manifest itself in scale-level analysis?

Impli-cations for translating language tests Language Testing, 20, 136–147.

Tiêu đề	Differential Item Functioning on an English Listening Test Across Gender
Tác giả	Gi-Pyo Park
Người hướng dẫn	John Flowerdew, Ali Shehadeh
Trường học	University of Leeds
Chuyên ngành	TESOL / Language Testing
Thể loại	Brief Reports and Summaries
Năm xuất bản	2008
Thành phố	Leeds

Định dạng
Số trang	9
Dung lượng	77,25 KB