ĐÁNH GIÁ MỘT SỐ KHÍA CẠNH TRONG TÍNH GIÁ TRỊ CỦA MỘT BÀI THI HẾT HỌC PHẦN MÔN ĐỌC HIỂU (BÀI THI HẾT HỌC PHẦN MÔN ĐỌC TIẾNG ANH 3B) NĂM HỌC 2013 2014 DÀNH CHO SINH VIÊN NĂM THỨ HAI, HỆ ĐẠI TRÀ, KHOA SƯ PHẠM TIẾNG ANH, TRƯỜNG ĐẠI HỌC NGOẠI NGỮ, ĐẠI HỌC QUỐC

Test evaluation is a complicated phenomenon, which has been paid much attention by number of researchers since the importance of language tests in assessing the achievements of students was raised. When evaluating a test, evaluators should focus on criteria of a good test of which validity and reliability are two important factors. In this current study, researcher chose the English 3B endofterm reading test for second year mainstream students at FELTE, ULIS, VNU in the school year 2013 2014 to evaluate with an aim at checking the content validity and construct validity as well as estimating the internal reliability of the test. From the interpretation of the data got from the test scores, survey questionnaires, and test specifications analysis, the researcher has found out that the English 3B endofterm reading test is reliable in the aspect of internal reliability. The content validity has been checked as well and the test is concluded to demonstrate a relatively high level of content relevance and show some evidence of the representativeness. Besides, it is also proved to show structure validity to some extent. However, the study remains limitations that lead to the researcher’s directions for future studies.

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES FACULTY OF ENGLISH LANGUAGE TEACHER EDUCATION

GRADUATION PAPER

AN EVALUATION OF SOME ASPECTS OF THE VALIDITY OF A READING ACHIEVEMENT TEST (THE 3B END-OF-TERM READING TEST) FOR SECOND YEAR MAINSTREAM STUDENTS

IN THE SCHOOL YEAR 2013 – 2014 AT FELTE,

Trang 2

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC NGOẠI NGỮ KHOA SƯ PHẠM TIẾNG ANH

KHÓA LUẬN TỐT NGHIỆP

ĐÁNH GIÁ MỘT SỐ KHÍA CẠNH TRONG TÍNH GIÁ TRỊ CỦA MỘT BÀI THI HẾT HỌC PHẦN MÔN ĐỌC- HIỂU (BÀI THI HẾT HỌC PHẦN MÔN ĐỌC TIẾNG ANH 3B) NĂM HỌC 2013-2014 DÀNH CHO SINH VIÊN NĂM THỨ HAI, HỆ ĐẠI TRÀ, KHOA SƯ PHẠM TIẾNG ANH, TRƯỜNG ĐẠI HỌC NGOẠI

NGỮ, ĐẠI HỌC QUỐC GIA HÀ NỘI

Giáo viên hướng dẫn: TS Dương Thu Mai

Sinh viên: Vũ Thị Hương

Khóa: QH2010

HÀ NỘI – NĂM 2014

Trang 3

I hereby state that I: Vũ Thị Hương, QH2010.F.1.E1, being a candidate for thedegree of Bachelor of Arts (TEFL), accept the requirements of the Collegerelating to the retention and use of Bachelor’s Graduation Paper deposited inthe library

In terms of these conditions, I agree that the origin of my paper deposited in thelibrary should be accessible for the purposes of study and research, inaccordance with the normal conditions established by the librarian for the care,loan or reproduction of the paper

Signature

Hương

Vũ Thị Hương

May 2014

Trang 4

I wish to take this opportunity to express my big thanks to Ms Trần Thị LanAnh, my ELT teacher, for her insightful comments and suggestions for thispaper.

I also owe my sincere thanks to the teachers at Division of English skills 2,FELTE, ULIS, VNU who have been the enthusiastic participants in my research.Without them, my research could not been completed and successful

I would like to send my thanks to my teachers, my friends and myclassmates for their sincere comments and criticism as well as encouragement

In the end, I would like to show my big gratitude to my family, especially

my mother, who have constantly inspired and encouraged me to overcomedifficulties to complete this study

Trang 5

Test evaluation is a complicated phenomenon, which has been paidmuch attention by number of researchers since the importance of languagetests in assessing the achievements of students was raised When evaluating atest, evaluators should focus on criteria of a good test of which validity andreliability are two important factors In this current study, researcher chose theEnglish 3B end-of-term reading test for second year mainstream students atFELTE, ULIS, VNU in the school year 2013 - 2014 to evaluate with an aim atchecking the content validity and construct validity as well as estimating theinternal reliability of the test From the interpretation of the data got from thetest scores, survey questionnaires, and test specifications analysis, theresearcher has found out that the English 3B end-of-term reading test isreliable in the aspect of internal reliability The content validity has beenchecked as well and the test is concluded to demonstrate a relatively high level

of content relevance and show some evidence of the representativeness.Besides, it is also proved to show structure validity to some extent However,the study remains limitations that lead to the researcher’s directions for futurestudies

Trang 6

TABLE OF CONTENTS

ACKNOWLEGEMENT i

ABSTRACT……… ii

TABLE OF CONTENTS iii

LIST OF FIGURES AND TABLES ……… vii

LIST OF ABBREVIATIONS … ……… x

PART I: INTRODUCTION……… 1

1 Statement of research problem and rationale for the study……… 2

2 Goals and objectives of the study……… 4

3 Research questions……….4

4 Significance of the study………4

5 Scope of the study ……… 5

6 Methodology of the study……… 6

7 The organization of the study……….6

PART II: DEVELOPMENT……… 7

CHAPTER 1: LITERATURE REVIEW ………8

1 Key concepts ……….8

1.1 Assessment, measurement, test and evaluation………8

1.1.1 Assessment………8

1.1.2 Measurement……….8

1.1.3 Test………9

1.1.4 Evaluation……… 9

1.2 Test purposes……….10

Trang 7

1.3 Types of test items……….11

1.3.1 Objective test items……… 12

1.3.2 Subjective test items……….13

1.4 English reading achievement tests……… 14

1.4.1 Definition of reading………14

1.4.2 The construct of reading performance in English and reading achievement tests… ……….15

1.5 The technical quality of reading tests ……… 17

1.5.1 Overview of all qualities ……… 17

1.5.2 Validity……….20

1.5.2.1 Definition of validity……… 20

1.5.2.2 Aspects of validity ………21

1.5.2.3 Factors affecting the validity of reading tests…… 24

1.5.3 Reliability……….25

1.5.3.1 Definition of reliability……… 25

1.5.3.2 Types of reliability……… 26

2 Review of related studies on validity of reading test……… 29

2.1 Studies on the validity of reading tests worldwide……… 29

2.2 Studies on the validity of reading test in Vietnam………30

CHAPTER 2: METHODOLOGY ……… 32

1 The reading assessment context for second year mainstream students at FELTE, ULIS, VNU……….32

1.1 Test administration procedure……… 32

1.2 Test specifications……… 32

2 Research questions………38

Trang 8

3 Research Participants and the selection of participants ……… 39

4 Data collection method……….39

4.1 Survey………39

4.2 Document observation……… 40

5 Data collection procedure……….40

5.1 Survey questionnaire……….40

5.2 Document observation……… 41

6 Data analysis and procedure……….42

CHAPTER 3: FINDINGS AND DISCUSSION……… 44

1 Data analysis and results……… 44

1.1 Research question 1: The content validity of the test as perceived by teachers ……….44

1.2 Research question 2: The structure validity of the test………61

1.3 Research question 3: The internal reliability of the test ………… 64

2 Findings and discussion………66

2.1 Major findings………66

2.2 Content validity of the test……….66

2.3 Structure validity of the test……… 67

2.4 Reliability of the test……… 68

PART III: CONCLUSION………69

1 Conclusion and implications………70

2 Limitations of the study ……… 71

3 Suggestions for further research……… 72

REFERENCES…… ………73

Trang 9

Appendix 1: The 3B end-of-term reading test for second year mainstream students

at FELTE,ULIS, VNU in the school year 2013-2014 ……… 79Appendix 2: Survey questionnaire for teachers.……….88Appendix 3: Internal reliability calculating results……….90

Trang 10

LIST OF FIGURES AND TABLES

FIGURES

Figure 1.1: Relation between evaluation, tests and measurements (Bachman 1990)

……… 10

Figure 2.1: Bloom’s Taxonomy (Old version) ………34

Figure 2.2: Bloom’s Taxonomy (Revised version)……… 34

Figure 2.3: Changes in the old and new version of Bloom’s taxonomy………….35

TABLES Table 1.1: Types of reliability………26

Table 1.2: Range of internal reliability value in Cronbach’s Alpha……… 28

Table 2.1: Test specifications……… 33

Table 2.2: Revised test specifications……… 35

Table 3.1: Teachers’ opinions about the tested skills in questions 71-75……….44

Table 3.2: Teachers’ opinions about the tested skills in questions 76-80……… 45

Table 3.3: Teachers’ opinions about the tested skills in question 81……….46

Table 3.4: Teachers’ opinions about the tested skills in question 82……….47

Table 3.6: Teachers’ opinions about the tested skills in questions 85………48

Table 3.7: Teachers’ opinions about the tested skills in questions 86-89…………49

Table 3.8: Teachers’ opinions about the tested skills in questions 90-93…………49

Table 3.10: Teachers’ opinions about the tested skills in questions 100-101……51

Trang 11

Table 3.11: Teachers’ opinions about the tested skills in question 102………….51Table 3.12: Teachers’ opinions about the tested skills in question 103………….52Table 3.13: Teachers’ opinions about the tested skills in questions 104-105……52Table 3.14: Teachers’ opinions about the tested skills in question 106………… 53Table 3.15: Teachers’ opinions about the tested skills in questions 107-110……54Table 3.16: Teachers’ opinions about the difficulty level of questions 71-75……54Table 3.17: Teachers’ opinions about the difficulty level of questions 76-80……55Table 3.18: Teachers’ opinions about the difficulty level of question 81……… 55Table 3.19: Teachers’ opinions about the difficulty level of question 82……… 56Table 3.20: Teachers’ opinions about the difficulty level of questions 83-84……56Table 3.21: Teachers’ opinions about the difficulty level of question 85……… 56Table 3.22: Teachers’ opinions about the difficulty level of questions 86-89……57Table 3.23: Teachers’ opinions about the difficulty level of questions 90-93……57Table 3.24: Teachers’ opinions about the difficulty level of questions 94-99……58Table 3.25: Teachers’ opinions about the difficulty level of questions 100-101…58Table 3.26: Teachers’ opinions about the difficulty level of questions 102………59Table 3.27: Teachers’ opinions about the difficulty level of question 103………59Table 3.28: Teachers’ opinions about the difficulty level of questions 104-105…60Table 3.29: Teachers’ opinions about the difficulty level of question 106………60Table 3.30: Teachers’ opinions about the difficulty level of questions 107-110…60

Trang 12

Table 3.31: The distribution of the skill tested in the test specifications and thecourse guide……… 62Table 3.32: Internal reliability statistics of the test……… 65

Trang 13

LIST OF ABRREVIATIONS

CAE: The Certificate in Advanced English

ELT: English Language Teaching

FCE: The First Certificate in English

FELTE: Faculty of English Language Teacher Education

IELTS: International English Language Testing System

PET: The Preliminary English Test

TOEFL: Test of English as a Foreign Language

ULIS: University of Languages and International Studies

VNU: Vietnam National University, Hanoi

Trang 15

PART I: INTRODUCTION

Trang 16

1 Statement of research problem and rationale for the study

Assessment plays an important role in the process of teaching and learning

It provides teachers “the information that is used for making decisions aboutstudents, curricula and programs, and educational policy” (Nitko, 1996) That kind

of information can be collected from assessments through a number of tools such astests, students’ diaries, portfolios, etc Among those tools, testing is an importanttool for educational assessment in general and for language assessment inparticular According to Heaton (1988), testing can be used as a method to fastenthe relationship between teaching and learning and to motivate students or simply

to assess the language performance of the students It has a close relationship withteaching which is “so closely interrelated that it is virtually impossible to work ineither field without being constantly concerned with the other” (Heaton, 1988).Testing is a tool to “pinpoint strengths and weaknesses in the learned abilities of thestudent” (Henning, 1987, p.1) It means that thanks to testing, teachers can evaluateand assess both their students’ abilities in general and language ability in particularand the teaching and learning process At the same time, learners can also self-evaluate their ability through testing Thus, Read (1997) states that “a test can helpboth teachers and learners to clarify what the learners really need to know.”Obviously, not only the teachers but also learners may achieve the benefits throughtesting That’s the reason why testing is implemented in schools at different levels

in general and at University of Languages and International Studies, VietnamNational University, Hanoi (ULIS, VNU) in particular

In spite of its crucial importance, designing a test is not an easy work.Sometimes, the content of the tests maybe suitable for this type of learners and theirlevels but it is not suitable for other types of learners who are at different levels.Thus, learners’ abilities may be evaluated inappropriately In fact, some universities

in Vietnam, a non-native English speaking environment, have to buy the test formatfrom a prestigious university to have a standardized test Nonetheless, it is still not

Trang 17

fair enough if the original tests are applied for Vietnamese students without anychanges in the test content.

In the context of ULIS, VNU, tests for students are sometimes adapted fromCambridge University with the tests at various levels such as PET, FCE and CAE

or tests from International English Language Testing System (IELTS) andTest of English as a Foreign Language (TOEFL) However, there is a fact that what

is tested may not be exactly what is taught in the course because the test is designedfrom faraway universities and the levels of students of those universities aredifferent from the supposed standards of the university, ULIS, VNU Besides, sometests are made by teachers themselves In this case, the tests may not be verifiedand thus its quality cannot be assured In addition, “what test writers are concernedwith seems to be the reliability of the test and its validity” (Le, 2010) That is tosay, reliability and validity are the two most essential qualities of a good languagetest They are also the main considerations of test writers when designing a test.However, a test may be reliable but not valid For example, a reading test withmany multiple-choice questions about vocabulary and grammar used in the passage

is reliable; nonetheless, it is not valid because it tests not only students’ readingskills but also their vocabulary and grammar It violates rules of testing that assuretest validity Therefore, “the most important quality to consider in the development,interpretation, and use of language tests is validity” (Bachman, 1990) Besides,little research work has been done into evaluating the validity of reading tests atULIS, VNU There is only some research into the validity of writing and speakingachievement tests such as those done by To (2001) and Dang (2008) Meanwhile,the importance of reading skill is as cardinal as other skills like listening, speakingand writing Thus, evaluating the validity of a reading test also plays a considerablerole in contributing to improve the quality of a test, hence reinforce the relationshipbetween testing and teaching and learning

All of the aforementioned reasons have inspired the researcher to carry out thisstudy to evaluate some aspects of the validity of reading achievement tests for

Trang 18

second year mainstream students at FELTE, ULIS, VNU by evaluating an term reading test in academic English (3B) for second year mainstream students atFELTE, ULIS, VNU at the end of the first semester in the school year 2013 - 2014.Hopefully, the result of the study can then help to improve the quality of theachievement tests for mainstream sophomores at FELTE, ULIS.

end-of-2 Goals and objectives of the study

The study aims at reviewing and analyzing current theories on validity inorder to investigate the validity of a reading achievement test hence identifyingand evaluating the level of validity that the 3B end-of-term reading test hasobtained in terms of content validity, structure validity, and internal reliability

In addition, the study is also expected to give some suggestions to improve theflaws, if any, of the current 3B end-of-term reading test for second yearmainstream students at FELTE, ULIS, VNU

3 Research questions

The study is conducted to answer the following questions:

 To what extent is the 3B end-of-term reading test for second yearmainstream students in the school year 2013 - 2014 at FELTE, ULIS,VNU valid in terms of content validity as perceived by teachers?

 To what extent is the 3B end-of-term reading test for second yearmainstream students in the school year 2013 - 2014 at FELTE, ULIS,VNU valid in terms of structure validity?

 What is the internal reliability of the 3B end-of-term reading test forsecond year mainstream students in the school year 2013 - 2014 atFELTE, ULIS, VNU?

4 Significance of the study

Once completed, the study would bring about certain advantages To bemore specific, the findings of the research would supply to test makers andteachers in the targeted context some useful information about whether theirinferences about the students are accurate or not and whether how the readingresult is true to students’ ability Hence, test makers and teachers or testing

Trang 19

experts would get more information about the real situation of testing atFELTE, ULIS, VNU and thus, find out solutions for the inadequacy, if any, inthe test’s content and structure In addition, the research would also be a source

of references for further research in the same field

5 Scope of the study

Firstly, in this paper the researcher emphasizes the reading achievement testsinstead of investigating other kinds of tests such as replacement tests ordiagnostic tests Secondly, the exploration of other language skills like writing,listening and speaking is not included in this study Furthermore, within thescope of graduation paper, only two aspects of validity, including content andstructure validity, together with the internal reliability of the test are studied It

is said that content and structure are two of the most important aspects of thevalidity of a test; moreover, internal reliability is also a crucial factor inevaluating the quality of a test The consistency within test items contributes agreat part in the quality of a good test Thirdly, due to the limitation of time andexperience, the study is carried out only with a reading achievement test inEnglish 3B Course – English for Academic Purposes for a group of second yearmainstream students at FELTE, ULIS, VNU The researcher chooses this group

of sample among the four school years at university as the second year is said to

be the most significant in the new local curriculum It is the last year that thestudents study with a common curriculum before they specialize into differenttypes of major dimension In addition, English 3B is more useful andfundamental for academic purposes than English 3A and 3C, which are Englishfor Social Purposes and English for Standardized Testing Purposes respectively.Thus, the researcher decides to study only the 3B reading achievement test ofsecond year mainstream students at FELTE, ULIS, VNU

The study provides empirical evidence of the current reading achievementtests and proposes practical suggestions on the improvement of the end-of-termreading test for ULIS second year students in general

Trang 20

6 Methodology of the study

In this study, the test is evaluated by adopting both qualitative andquantitative methods The research is quantitative in the sense that the datawill be collected through the analysis to the scores of the 100 random testpapers of students at FELTE, ULIS To calculate the internal reliabilityresearcher uses Cronbach’s Alpha formula via SPSS 16.0 software It isqualitative in the aspect of using open-ended survey questionnaires whichwere delivered to teachers at Division II, FELTE, ULIS,VNU and thecomparison of the test specification with the lesson objectives in theEnglish 3B course guide The conclusion to the analysis and the comparisonwould be used as the qualitative data of the research

7 The organization of the study

The study is divided into three parts:

Part I: Introduction – is the presentation of basic information such as the

statement of research problem, the rationale, the scope, the objectives, themethodology, and the organization of the study

Part II: Development – This part consists of three chapters

Chapter 1: Literature Review – in which the literature that related to

language of testing and test evaluation

Chapter 2: Methodology – is concerned with the methods of the study,

the selection of participants, the materials and the methods of datacollection and analysis as well the results of the process of data analysis

Chapter 3: Results and Findings - in which the results of the study is

presented and analyzed; and some findings are also reported

Part III: Conclusion – this part will be the summary to the study,

limitations as well the recommendations for further studies

Trang 21

PART II: DEVELOPMENT

Trang 22

CHAPTER 1: LITERATURE REVIEW

This chapter is an attempt to establish the theoretical background for the study.The key concepts of language testing including measurement, tests and evaluation,validity and reliability and some related studies worldwide and in Vietnam will bereviewed

1 Key concepts

1.1 Assessment, measurement, test and evaluation

The terms “assessment”, “measurement”, “test” and “evaluation” are sometimesused as synonyms and in reality they can mention to the same activity (Bachman,1990) For example, when someone is asked about his or her evaluation of astudent, he or she often gives the test score of that students and bases on that toevaluate the student However, they still have some distinctive features

1.1.1 Assessment

According to Nitko (1996), assessment is “a broad term defined as a processfor obtaining information that is used for making decisions about students, curriculaand programs, and educational policy.” For example, basing on assessment,teachers can make decisions about managing classroom instruction, placingstudents into different types of educational programs, assign them to appropriatecategories or making decisions about the effectiveness of the programs or solutions

to improve them (Nitko, 1996) Because assessment is a broad term, it can bedrawn that assessment refers to all methods used to gather information aboutlearner’s knowledge and skills Shortly speaking, “assessment is a broader termthan test or measurement, because not all types of assessments yieldmeasurements.” (Nitko, 1996)

1.1.2 Measurement

Bachman (1990) states that measurement, in the social science, is “theprocess of quantifying the characteristic of persons according explicit procedures

Trang 23

and rules.” Nitko (1996) clarifies measurement is “a procedure for assigningnumbers to a specific attribute or characteristic of a person in such a way that thenumbers describe the degree to which the person possesses the attribute.” Fromthese points of view, it is shown that measurement is a method of interpretingnumber to find out the attribute of a subject Due to its features, measurement isessential in language teaching and learning.

From this definition, Bachman (1990) states that a test is one type ofmeasurement that is designed to “elicit a specific sample of an individualbehavior” However, he also adds that the distinction between a test and other types

of measurement is that it is used to obtain a particular action Obviously,measurement is a broader concept than a test A test, as stated above, is just a toolfor assessment or measurement

1.1.4 Evaluation

Nitko (1996) gives the definition of evaluation as “the process of making avalue judgment about the worth of a student’s product of performance.” At thispoint, he emphasizes the relationship between students’ behaviors and the judgment

on them At the same point, Genesee and Upshur (1996) claim that evaluation isbasically about making decision This is also the view of Weiss (1972, cited inBachman, 1990) in which evaluation is “the systematic gathering of information forthe purpose of making decisions.” However, evaluation may be separated fromtests and measurements In this situation, evaluation might be carried without anytest or measurement because “evaluation may or may not be based on

Trang 24

measurements or test results” (Nitko, 1996) As such, evaluation “does notnecessarily entail testing” (Bachman, 1990) The relationship between evaluation,tests and measurement is represented in the chart below:

Figure 1.1: Relation between evaluation, tests and measurements (Bachman, 1990)

As can be seen from the graph, evaluation and measurement are two differentnotions but they still have some common features Furthermore, testing is a method

of measurement and hence tests also have some shared characteristic withevaluation In a nutshell, evaluation, measurement and tests always have a closerelationship with each other and they are essential components of language testing

There are various ways to classify test purposes Wiersma and Jurs (1990)suggest a list of test purposes about the tasks that a test is expected to perform

Description: Many tests are developed to describe the current status of

individuals on a wide range of variables

Trang 25

Prediction: It means that some tests are used for the purpose of predicting

examinees’ performance in the future

Assessing individual differences: Some tests are used to differentiate

between people in order to identify those who are the highest and those whoare the lowest on some measures

Objectives evaluation: Some tests are used to report progress of the students

compare with the objectives of a course or a program and to plan instruction

in terms of the objectives

Domain estimation: For this purpose, many tests are designed to estimate the

percentage of a domain that the student understands

Mastery decisions: Some tests need to be constructed in such a way that

mastery and non-mastery are unambiguously determined and the mastersand non-masters are clearly separated by test scores

Diagnosis: Many tests are designed to diagnose the students’ strength and

weakness through test performance, or to be more specific, through testscore on one or more tests

Pre- and Post-assessment: “The purpose of many tests is to document the

gains that students have made in school” (Wiersma & Jurs, 1990) That is tosay, the tests in this case focus on the change in the status or the score oftest-takers in the pre-test and the post-test

These purposes are the bases for classifying test types Basing on the purposes

of the test, test designers give the appropriate name for the tests, design testspecifications and let test takers do the tests

1.3 Types of test items

According to Brown (1996), an item in a language test is “the smallest unit thatproduces distinctive and meaningful information on a test or rating scale.” Thisdefinition has already shown the use of test items and its importance in a test It is abasic unit of a test There are different types of test items Nonetheless, they may be

Trang 26

grouped into two main groups of test items that are objective and subjective testitems.

1.3.1 Objective test items

Objective test items are items that can be marked objectively They includemultiple choice questions, true-false items and matching items Subjective testitems consist of gap-filling items, short answer questions and essay items Eachtype of these items has its own features

Multiple-choice items

The most difficult type of test items to make is multiple-choice items Amultiple-choice item, according to Nitko (1996), “consists of one or moreintroductory sentences followed by a list of two or more suggested responses” Inthis type of item, students have to choose the right answer from the options listedfor the question A multiple-choice test item includes two parts: stem andalternatives The stem is the part of the item that asks question (Nitko, 1996) Inorder to ensure the quality of the item, the stem should be written carefully so thatstudents may understand what task to perform or what question to answer.Alternatives are the listed responses in the item They can be called by variousnames, for example, alternatives, choices, responses, and options Test designersoften arrange the alternatives in a meaningful way (logically, numerically,alphabetically, etc.) in order not to clue the answer for the students and savestudents’ time (Nitko, 1996) Among the alternatives of an item, there are keyedalternative and distractors The keyed answer or the key alternative, or simplyspeaking the key is the only correct or the best answer to the question or theproblem posed For the purpose of ensuring the validity of the item, the test should

be designed to assess students’ performance on different formats of tasks Inaddition, the levels of difficulty of test items also need adjusting appropriate tostudents’ proficiency The purpose of the tasks is also another consideration beforecrafting multiple-choice items Nitko (1996) states that:

Trang 27

“The basic purpose of an assessment task, whether or not it is a choice item, is to identify those students who have attained a sufficient (or necessary) level of knowledge (skill, ability, or performance) of the learning target being assessed”

multiple-If these aforementioned things are assured, the test will be much more valid

true-“cover a wide range of content within a relatively short period” (Nitko, 1996).However, the problem is that if the items are not constructed well, they can onlyassess specific, trivial facts and test takers can guess the answers; this type of itemcan also be worded ambiguously Thus, the test can become invalid

1.3.2 Subjective items

Subjective items are items that must be marked with subjective judgmentfrom the markers This type of items includes short answer and completion items

Short answer and completion items

“Short answer items require a student to respond to each item with a word,short phrase, number or symbol” (Nitko, 1996) According to Nitko (1996), thereare three various types of short answer items: question, completion and association.The question variety asks students a direct question whereas the completion varietyexpects students to add words to complete an incomplete statement Meanwhile, theassociation variety includes ‘a list of term or a picture for which students have torecall numbers, labels, symbols, or other items.’ (Nitko, 1996)

Trang 28

Short answer and completion task type are easy to construct and students havelower probability of guessing the correct answer However, it is difficult to scorethis type items subjectively That is to say, the rater cannot foresee all the possibleresponses that students can make In marking essay items, subjectivity isunavoidable Although subjective judgment is appropriate, it makes the scoringprocess slow down and hence the reliability of the obtained scores is also lowered.Test designers, therefore, should consider all the advantages and disadvantages

of types of test items before crafting any of them to assure the quality of a goodtest The test items in a test should also be various to assure the reliability as well asvalidity of a test

1.4 English reading achievement tests

1.4.1 Definition of reading

Caroll (1964) emphasizes reading as “the activity of reconstructing themessages that reside in printed text” This conception of reading as the finding ofpre-existent meanings is arguably the dominant construct in many readingcomprehension tests, especially those that rely heavily on multiple-choice formats(Hill & Parry, 1992; Alderson, 2000) By the same token, Aebersold (1997)suggests that reading is the interaction between the reader and the text This point

of view is clarified in McKay’s study (2006) To him, reading includes makingmeaning from the text It is often accompanied by writing and called literacy(McKay, 2006) That is to say, reading is a skill that often has the interaction ofreaders with the text and testing reading is mainly in written form (Camaron, 2001cited in McKay, 2006) Furthermore, reading is not process but also product(McKay, 2006) The process there refers to the process of interaction between thereaders and the text, which is also pointed by Aebersold (1997) According toMcKay, the product of reading is reading comprehension He also argues that therecan be two approaches to assess reading which are examining the process ofreading and examining carefully the product of reading or compare that productwith the original reading text

Trang 29

1.4.2 The construct of reading ability in English reading tests

Anderson (2000) defines that “a construct is a psychological concept, whichderives from a theory of the ability to be tested Constructs are the maincomponents of the theory, and the relationship between these components is alsospecified by the theory.” For instance, “some theories of reading state that there aredifferent constructs involve in reading (skimming, scanning, etc.) and that theconstructs are different from one another” (Anderson, 1995) In addition, it isnecessary to bear in mind that constructs are not psychologically real entities butthey are abstractions used for the purposes of assessment (Anderson, 2000)

In terms of the construct of reading ability, Alderson (2000) argues that thebasis for the reading constructs is a model of reading and factors affecting reading

He suggests that test designers should include word recognition ability, and the

automaticity with which this happens, is obviously the centre of fluent reading

(Anderson, 2000) Moreover, meta-cognitive knowledge and monitoring are alsoregarded as important components of good reading There are also informal readingassessments According to Ruscoe (2002), informal reading tests often test readers’

phonological awareness, phonics, fluency, vocabulary and comprehension.

Nevertheless, in the First Certificate in English test (FCE), the description of thereading test is as follows:

“Candidates are expected to be able to read semi-authentic texts of various kinds (informative and general interest) and to show understanding of gist, detail and text structure and to deduce meaning”

Trang 30

“(Know) how to understand main ideas and how to find specific information; (Do) survey the text; analyze the questions; go back to the text to find answers; check your answers.”

(De Witt, 1995)

Carver (1997) recognises five basic elements: scanning, skimming, rauding,

learning and memorising Rauding is defined as ‘normal’ or ‘natural’ reading,

which occurs when adults are reading something that is relatively easy for them tocomprehend (Carver, 1997) For Grabe and Stoller (2002), the activity of reading isbest captured under seven headings:

 Reading to search for simple information

 Reading to skim quickly

 Reading to learn from texts

 Reading to integrate information

 Reading to write (or search for information needed for writing)

 Reading to critique texts

 Reading for general comprehension

One notes that this latter list takes on a slightly simplified form in a recentstudy conducted for the TOEFL reading test (Enright et al., 2000):

 Reading to find information (or search reading)

 Reading for basic comprehension

 Reading to learn

 Reading to integrate information across multiple texts

However, in a recent study into the IELTS academic reading test conducted in

2009, Weir et al propose a more useful and simple taxonomy Instead of compiling

a list of separate skills, the author construct their taxonomy around two dimensions

of difference: reading level and reading type In terms of reading level, there is adistinction made between reading processes focused on text at a more global level,

Trang 31

and those operating at a more local level In terms of reading type, the distinction isbetween what is called ‘careful’ reading and ‘expeditious’ reading, the formerinvolving a close and detailed reading of texts, and the latter involving “quick andselective reading to extract important information in line with intended purposes”(Weir et al., 2009) The ‘componential matrix’ formed by the two dimensions ofWeir and his colleagues has the advantage of being a more dynamic model, onethat is capable of generating a range of reading modes.

In a nutshell, it can be concluded that the construct of reading in different testsare various which depends on the purposes of the tests Test takers are testedreading ability at different levels, which are appropriate to their capability Fromthose construct, test designers can base on and make appropriate test items or testtasks Thus, the quality of a good test may be better assured

1.5 The technical quality of reading tests

1.5.1 Overview of all qualities

There are some qualities that a good reading test should contain Bachman(1996) contends test usefulness contains six main qualities, namely, validity,reliability, impact, authenticity, practicality, and instructiveness

Validity

Bachman (1990) states that people often concern validity with the question,

“How much of an individual’s test performance is due to the language ability wewant to measure?” and with “maximizing the effects of these abilities on testscores” It can be entailed that validity refers to the extent to which the specificinference made from test scores is appropriate, meaningful and useful.Furthermore, validity is the most important quality of a good test Asmentioned above, a test can be reliable but not valid In other words, a test,despite its reliability, can have no validity (Brown, 1996) Bachman (1990)also points out that “reliability is a requirement for validity” That is to say, the

Trang 32

quality of a test is not assured if its validity is not achieved In short, it can bedrawn that validity is the most important characteristic of a good test

Reliability

Bachman and Palmer (1996) claim that “reliability is often defined asconsistency of measurement” (p.19) By the same token, reliability, in Geneseeand Upshur’s (1996) point of view, refers to the consistency, the stability and thefreedom from nonsystematic fluctuation This definition covers nearly all theaspects of reliability that other researchers concern Bachman (1990) suggests thatreliability is concerned with answering the question, ‘How much of an individual’stest performance is due to measurement error, or to factors other than the languageability we want to measure?’ For example, a student’s score on a test in the firsttime should be equal to that of the second time he or she takes the same test.Reliability together with validity is two of the most important qualities of a goodtest However, Giap (2008) argues that reliability is regarded to be a “necessary butnot sufficient” quality of a good test As such, test can be reliable but not valid Forexample, a reading test with multiple-choice questions about grammar andvocabulary may be reliable but it is not valid because it does not measure thereading skills only

Impact

Impact, according to Bachman and Palmer (1996), can be defined broadly interms of the different ways the use of a test affects a society, an educational system,and the individuals within them Generally, a test operates on a large scale in asocietal educational system while corresponding to individuals that are test takers,

on a small scale

Authenticity

Bachman (1991) defines authenticity as the appropriateness of a languageuser’s response to language as communication However, this definition was notspecific enough Therefore, Bachman and Palmer (1996) divide this idea into two

Trang 33

parts The first one relates to the target language's use (TLU), which they refer to asauthenticity; and they define the second one according to its relation to the learnersinvolved in the test The two authors regard authenticity as the degree to which thecharacteristics of a given language test tasks correspond to the features of a TLUtask Authenticity also relates a test's tasks to the domain of generalization to whichtest designers want the interpretations of the scores to be generalized It potentiallyaffects test takers' perceptions of the test and their performance (Bachman, 2000).

Practicality

“Practicality is the relationship between the resources that will be required indesign, development, and use of the test and the resources that will be available forthese activities” (Bachman & Palmer, 1996, p.36) Thus, administration is theprimary question of practicality (Genesee & Upshur, 1996) That is to say, thepracticality of a good test concerns criteria such as its administrative time, thefacilities required for administrate the test, the printing of the papers, the personneland the handling of marking scores It has a close relationship with the reliabilityand the validity of a test For instance, if the printing of the test is not good then itcan violate the response of the test takers hence affect the validity as well as thereliability of the test Thus, test designers should avoid these things like this Inconclusion, “tests should be as economical as possible in time (preparation, sittingand marking) and in cost (materials and hidden costs of time spent)” (Heaton,1990)

Interactiveness

Interactiveness, according to Bachman and Palmer (1996), is “the extent andtype of involvement of the test taker’s individual characteristics in accomplishing atest task” (p 25) The interactiveness of a test is affected by a number of factorswhich are often accompanied by questions such as Does the test motivate students?

Is the language used in the test's questions and instructions appropriate for thestudents' level? Do the test's items represent the language used in the classroom, aswell as the target language?

Trang 34

At the same time, there are also some different qualities of a good test such

as discrimination, the level of difficulty and the mean (Giap, 2008) Accordingly,discrimination is “the spread of scores produced by a test, or the extent to which

a test separates students from one another on a range of scores from high tolow” (Giap, 2008) Discrimination is also used to describe “the extent to which

an individual multi-choice item separates the students who do well on the test

as a whole from those who do badly” (Giap, 2008) The level of difficulty, asGiap (2008) states, is “the extent to which a test or test item is within the abilityrange of a particular candidate or group of candidates” The mean is a

“descriptive statistic, measuring central tendency The mean is calculated bydividing the sum of a set of scores by the number of scores”

Because of time and experience limitation, in this study, the researcherfocuses on only two primary qualities of a good test, which are validity andreliability

1.5.2 Validity

1.5.2.1 Definition of validity

Bachman (1990) suggests validity is a “unitary concepts” That is, althoughevidence of validity can be collected in various ways, it always refers to “thedegree which that evidence supports the inferences that are made from the scores”(Bachman, 1990, p 237) This point of view has been clarified by Brown (1996)

He defines validity as “the degree to which a test measures what it claims, orpurports, to be measuring” (Brown, 1996, p.231) Nevertheless, Messick (1995)contends that validity is “the meaning of the test scores” He claims that “Indeed,validity is broadly defined as nothing less than an evaluative summary of both theevidence for and the actual – as well as potential- consequences of scoreinterpretation and use” (Messick, 1995) This view integrates content, criteria aswell as consequences consideration into a construct framework for the hypothesesabout score meaning and use In a nutshell, validity is about the matter of the

Trang 35

meaning of test scores to indicate whether the test measures what it is supposed tomeasures.

1.5.2.2 Aspects of validity

Traditionally, validity is divided into three “separate and substitutable types”namely, content validity; construct validity, and criterion validity (Hughes, 1989;Bachman, 1990; Nitko, 1996)

Content validity

Content validity refers to the content relevance and content coverage of atest (Bachman, 1990) In terms of content relevance, a test is valid if its domainspecification and the specification of the test method facets are valid A test hascontent validity built into it by careful selection of which items to include Itemsare chosen so that they comply with the test specifications that is drawn up through

a thorough examination of the subject domain The content coverage of the test, onthe other hand, considers “the extent to which the tasks required in the testadequately represents the behavioral domain in question” (Bachman, 1990).Content validity is very important in evaluating the validity of the test in terms ofthat “the greater a test’s content validity, the more likely it is to be an accuratemeasure of what is supposed to measure” (Hughes, 1989, p.22)

Construct validity

A test has construct validity if it demonstrates an association between thetest scores and the prediction of a theoretical trait Intelligence tests are oneexample of measurement instruments that should have construct validity Constructvalidity is viewed from a purely statistical perspective in much of the recentAmerican literature (Bachman & Palmer, 1996) It is seen in principle as a matter

of the posterior statistical validation of whether a test has measured a construct thathas a reality independence of other constructs

To understand whether a piece of research has construct validity, three stepsshould be followed First, the theoretical relationships must be specified Second,

Trang 36

the empirical relationships between the measures of the concepts must beexamined Finally, the empirical evidence must be interpreted in terms of how itclarifies the construct validity of the particular measure being tested.

Criterion-related validity

Criterion-related validity is used to demonstrate the accuracy of a measure orprocedure by comparing it with another measure or procedure which has beendemonstrated to be valid In other words, the concept is concerned with the extent

to which test scores correlate with a suitable external criterion of performance.Criterion-related validity consists of two types: concurrent validity, where the testscores are correlated with another measure of performance, usually an olderestablished test, taken at the same time and predicative validity, where test scoresare correlated with some future criterion of performance (Bachman, 1990)

However, Messick (1995, p.741) argues that this conception is “fragmentedand incomplete” because it is not successful in addressing the score meaning andthe social value in test interpretation and test use He supposes validity is a unifiedconcept integrating content validity, criterion validity, and consequence of scoreuse into a construct framework, or construct validity In this framework, he raisessix main aspects of construct validity That is to say, construct validity containscontent, substantive, structural, generalizability, external and consequential aspects.Apparently, Messick’s view considers the validity of tests in a more detailed wayand thus, the effects of evaluating the quality of test can be enhanced Therefore,hereafter, the validity mentioned in this study is classified according to Messick’s(1995) framework and the researcher would base on this framework to evaluate thevalidity of the reading test

The content aspect of construct validity contains evidence of contentrelevance, and representativeness (Messick, 1995) of the test The contentrelevance of the test refers to the coverage of important parts of the constructdomain and the representativeness indicates both the difficulty level of the tasks inthe test and the coverage of important parts of the construct domain Simply

Trang 37

speaking, in the language of achievement testing, the content validity of the testconsiders the question whether the test tests what the learners have been taught orsomething unrelated to the learnt contents That is, the test has appropriate tasks tothe specifications of the boundaries of the construct domain – which are thedetermining knowledge, skills, attitudes, motives, and other attributes to berevealed by the assessment tasks - addressed to the test

The substantive aspect of construct validity refers to the theoretical rationalefor the observed consistencies in test responses, including process models of taskperformance, as well as empirical evidence that the theoretical processes areactually engaged by respondents in the assessment tasks (Messick, 1995).Accordingly, this aspect of construct validity emphasizes two important things thatare the need for tasks providing appropriate sampling of domain processes inaddition to traditional coverage of domain content, and the need to move beyondtraditional professional judgment of content to accumulate empirical evidence thatthe apparently sampled processes are actually engaged by respondents in taskperformance The substantive aspect adds to the content aspect of construct validitythe need for empirical evidence of response consistencies or performanceregularities reflective of domain processes

The structural aspect of construct validity, on the other hand, concerns the

“fidelity of the scoring structure to the structure of the construct domain at issue(Loevinger, 1957; Messick, 1989b” (Messick, 1995) It means that there should be

a reasonable connection between the construct domain structure and the scale ofscoring system For example, in a language test, the score given to each skills tested

in the test should correlate rationally with the importance of the skill mentioned inthe course objectives

The generalizability aspect of construct validity mentions the extent towhich score properties and interpretations generalize to and across populationgroups, settings, and tasks In the other words, the generalizability aspect has aclose relationship with the score meaning The generalizability of the test depends

Trang 38

on “the correlation of the assessed tasks with other tasks representing the construct

or aspects of the construct” (Messick, 1995)

The external aspect of construct validity refers to “the extent to which theassessment scores' relationships with other measures and non-assessment behaviorsreflect the expected high, low, and interactive relations implicit in the theory of theconstruct being assessed” (Messick, 1995) That is, the meaning of the scores isevaluated by examining the degree to which empirical relationships with othermeasures are consistent with that meaning In other words, the constructsrepresented in the assessment should reasonably explain the external pattern ofcorrelations

The consequential aspect of construct validity assesses the evidence andrationales for evaluating the intended and unintended consequences of scoreinterpretation and use in both the short-and long-term

Furthermore, because of the researcher’s limitation of time and experience,the study focuses on only the content validity and structure validity of the test Inother words, the focus of this research is on the content and structural aspects ofconstruct validity Thus, these two aspects are paid much more attention in thisresearch than the other four aspects mentioned above

After all, it can be concluded that the content and the structural aspects ofconstruct validity of a test play an important role in its quality It might not beenough if these two aspects of quality are ignored in evaluating an achievementtest

1.5.2.3 Factors affecting the validity of reading tests

Messick (1995) raises two major factors that threaten the validity of a test,namely, construct underrepresentation and construct-irrelevant variance Theformer factor is caused by the fact that the test is “too narrow and fails to includeimportant dimension or facets of focal constructs” (Messick, 1995, p.742) That is tosay, the test does not measure adequately what the test mainly intend to measure

Trang 39

Meanwhile, the later factor is made because “the assessment is too abroad,containing excess reliable variance that is irrelevant to the interpreted construct.”(Messick, 1995, p.742)

Basically, construct-irrelevant variance can be divided into two kinds Intesting, the two kinds of construct-irrelevant variance might be called construct-irrelevant difficulty and construct- irrelevant easiness (Messick, 1995) The reason

of construct-irrelevant difficulty, according to Messick (1995), is that some aspects

of the focusing construct are irrelevantly difficult for some individuals or groupstaking the test This may lead to the invalidly low scores for the individuals orgroups who are affected and the “bias in test scoring and interpretation and ofunfairness in test use.” (Messick, 1995) In contrast, construct-irrelevant easiness iscaused by the fact that some clues or items in the test can lead to some individuals

or groups to get the answer without considering what the test provides This type ofinvalidity also occurs when test materials is “either deliberately or inadvertently,highly familiar to some respondents” (Messick, 1995) Unlike construct-irrelevantdifficulty, construct-irrelevant easiness leads to “invalidly high scores” for theaffected individuals or groups

In conclusion, test designers should avoid these two threats of invalidity inwhen designing a test order to make it become a good test sample

1.5.3 Reliability

1.5.3.1 Definition of reliability

As stated above, reliability is defined by Bachman and Palmer (1996) as

“the consistency of measurements” This quality of a test often considers the testscores of test takers A reliable test score will be consistent across differentcharacteristics of the testing situations Hence, reliability can be considered afunction of consistency of scores from one set of test tasks to another

Nevertheless, it is said that reliability is a necessary but not a sufficientquality of a test Additionally, the reliability of a test should be closely

Trang 40

engaged with its validity While reliability focuses on the empirical aspects ofthe measurement process, validity focuses on theoretical aspects and seeks tointerweave these concepts with the empirical ones Thus, it is easier to assessreliability than validity.

1.5.3.2 Types of reliability

According to test evaluators, reliability can be estimated by some ofmethods such as “parallel form, split half, rational equivalence, test-retest andinter-rater reliability checks” (Milanovic et al., 1999, p.168) According toShohamy (1985), the following table summarizes the types and the description

as well the ways to calculate the reliability:

Table 1.1: Types of reliability

1 Test-retest The extent to which the test

score are stable from one administration to another assuming no learning occurred between two occasions

Correlations between scores

of the same test given on two occasions

2 Parallel form The extent to which 2 tests

taken from the same domain measure the same things

Correlations between two forms of the same rater on different occasions or one occasion

3 Internal consistency The extent to which the test

questions are related to one another, and measure the same trait

Kuder-Richardson Formula 21

4 Intra-rater The extent to which the same

rater is consistent in his rating form one occasion to another, or in occasions but

Correlations between scores

of the same rater on different occasions, or one occasion.

Tiêu đề	Đánh giá một số khía cạnh trong tính giá trị của một bài thi hết học phần môn đọc hiểu (bài thi hết học phần môn đọc tiếng anh 3B) năm học 2013 2014 dành cho sinh viên năm thứ hai, hệ đại trà, khoa sư phạm tiếng anh, trường đại học ngoại ngữ, đại học quốc gia hà nội
Tác giả	Vũ Thị Hương
Người hướng dẫn	TS. Dương Thu Mai
Trường học	Vietnam National University, Hanoi University of Languages and International Studies
Chuyên ngành	English Language Teacher Education
Thể loại	Khóa luận tốt nghiệp
Năm xuất bản	2014
Thành phố	Hà Nội

Định dạng
Số trang	108
Dung lượng	864 KB