Test evaluation is a complicated phenomenon, which has been paid much attention by number of researchers since the importance of language tests in assessing the achievements of students was raised. When evaluating a test, evaluators should focus on criteria of a good test of which validity and reliability are two important factors. In this current study, researcher chose the English 3B endofterm reading test for second year mainstream students at FELTE, ULIS, VNU in the school year 2013 2014 to evaluate with an aim at checking the content validity and construct validity as well as estimating the internal reliability of the test. From the interpretation of the data got from the test scores, survey questionnaires, and test specifications analysis, the researcher has found out that the English 3B endofterm reading test is reliable in the aspect of internal reliability. The content validity has been checked as well and the test is concluded to demonstrate a relatively high level of content relevance and show some evidence of the representativeness. Besides, it is also proved to show structure validity to some extent. However, the study remains limitations that lead to the researcher’s directions for future studies.
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES FACULTY OF ENGLISH LANGUAGE TEACHER EDUCATION
GRADUATION PAPER
AN EVALUATION OF SOME ASPECTS OF THE VALIDITY OF A READING ACHIEVEMENT TEST (THE 3B END-OF-TERM READING TEST) FOR SECOND YEAR MAINSTREAM STUDENTS
IN THE SCHOOL YEAR 2013 – 2014 AT FELTE,
Trang 2ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC NGOẠI NGỮ KHOA SƯ PHẠM TIẾNG ANH
KHÓA LUẬN TỐT NGHIỆP
ĐÁNH GIÁ MỘT SỐ KHÍA CẠNH TRONG TÍNH GIÁ TRỊ CỦA MỘT BÀI THI HẾT HỌC PHẦN MÔN ĐỌC- HIỂU (BÀI THI HẾT HỌC PHẦN MÔN ĐỌC TIẾNG ANH 3B) NĂM HỌC 2013-2014 DÀNH CHO SINH VIÊN NĂM THỨ HAI, HỆ ĐẠI TRÀ, KHOA SƯ PHẠM TIẾNG ANH, TRƯỜNG ĐẠI HỌC NGOẠI
NGỮ, ĐẠI HỌC QUỐC GIA HÀ NỘI
Giáo viên hướng dẫn: TS Dương Thu Mai
Sinh viên: Vũ Thị Hương
Khóa: QH2010
HÀ NỘI – NĂM 2014
Trang 3I hereby state that I: Vũ Thị Hương, QH2010.F.1.E1, being a candidate for thedegree of Bachelor of Arts (TEFL), accept the requirements of the Collegerelating to the retention and use of Bachelor’s Graduation Paper deposited inthe library
In terms of these conditions, I agree that the origin of my paper deposited in thelibrary should be accessible for the purposes of study and research, inaccordance with the normal conditions established by the librarian for the care,loan or reproduction of the paper
Signature
Hương
Vũ Thị Hương
May 2014
Trang 4I wish to take this opportunity to express my big thanks to Ms Trần Thị LanAnh, my ELT teacher, for her insightful comments and suggestions for thispaper.
I also owe my sincere thanks to the teachers at Division of English skills 2,FELTE, ULIS, VNU who have been the enthusiastic participants in my research.Without them, my research could not been completed and successful
I would like to send my thanks to my teachers, my friends and myclassmates for their sincere comments and criticism as well as encouragement
In the end, I would like to show my big gratitude to my family, especially
my mother, who have constantly inspired and encouraged me to overcomedifficulties to complete this study
Trang 5Test evaluation is a complicated phenomenon, which has been paidmuch attention by number of researchers since the importance of languagetests in assessing the achievements of students was raised When evaluating atest, evaluators should focus on criteria of a good test of which validity andreliability are two important factors In this current study, researcher chose theEnglish 3B end-of-term reading test for second year mainstream students atFELTE, ULIS, VNU in the school year 2013 - 2014 to evaluate with an aim atchecking the content validity and construct validity as well as estimating theinternal reliability of the test From the interpretation of the data got from thetest scores, survey questionnaires, and test specifications analysis, theresearcher has found out that the English 3B end-of-term reading test isreliable in the aspect of internal reliability The content validity has beenchecked as well and the test is concluded to demonstrate a relatively high level
of content relevance and show some evidence of the representativeness.Besides, it is also proved to show structure validity to some extent However,the study remains limitations that lead to the researcher’s directions for futurestudies
Trang 6TABLE OF CONTENTS
ACKNOWLEGEMENT i
ABSTRACT……… ii
TABLE OF CONTENTS iii
LIST OF FIGURES AND TABLES ……… vii
LIST OF ABBREVIATIONS … ……… x
PART I: INTRODUCTION……… 1
1 Statement of research problem and rationale for the study……… 2
2 Goals and objectives of the study……… 4
3 Research questions……….4
4 Significance of the study………4
5 Scope of the study ……… 5
6 Methodology of the study……… 6
7 The organization of the study……….6
PART II: DEVELOPMENT……… 7
CHAPTER 1: LITERATURE REVIEW ………8
1 Key concepts ……….8
1.1 Assessment, measurement, test and evaluation………8
1.1.1 Assessment………8
1.1.2 Measurement……….8
1.1.3 Test………9
1.1.4 Evaluation……… 9
1.2 Test purposes……….10
Trang 71.3 Types of test items……….11
1.3.1 Objective test items……… 12
1.3.2 Subjective test items……….13
1.4 English reading achievement tests……… 14
1.4.1 Definition of reading………14
1.4.2 The construct of reading performance in English and reading achievement tests… ……….15
1.5 The technical quality of reading tests ……… 17
1.5.1 Overview of all qualities ……… 17
1.5.2 Validity……….20
1.5.2.1 Definition of validity……… 20
1.5.2.2 Aspects of validity ………21
1.5.2.3 Factors affecting the validity of reading tests…… 24
1.5.3 Reliability……….25
1.5.3.1 Definition of reliability……… 25
1.5.3.2 Types of reliability……… 26
2 Review of related studies on validity of reading test……… 29
2.1 Studies on the validity of reading tests worldwide……… 29
2.2 Studies on the validity of reading test in Vietnam………30
CHAPTER 2: METHODOLOGY ……… 32
1 The reading assessment context for second year mainstream students at FELTE, ULIS, VNU……….32
1.1 Test administration procedure……… 32
1.2 Test specifications……… 32
2 Research questions………38
Trang 83 Research Participants and the selection of participants ……… 39
4 Data collection method……….39
4.1 Survey………39
4.2 Document observation……… 40
5 Data collection procedure……….40
5.1 Survey questionnaire……….40
5.2 Document observation……… 41
6 Data analysis and procedure……….42
CHAPTER 3: FINDINGS AND DISCUSSION……… 44
1 Data analysis and results……… 44
1.1 Research question 1: The content validity of the test as perceived by teachers ……….44
1.2 Research question 2: The structure validity of the test………61
1.3 Research question 3: The internal reliability of the test ………… 64
2 Findings and discussion………66
2.1 Major findings………66
2.2 Content validity of the test……….66
2.3 Structure validity of the test……… 67
2.4 Reliability of the test……… 68
PART III: CONCLUSION………69
1 Conclusion and implications………70
2 Limitations of the study ……… 71
3 Suggestions for further research……… 72
REFERENCES…… ………73
Trang 9Appendix 1: The 3B end-of-term reading test for second year mainstream students
at FELTE,ULIS, VNU in the school year 2013-2014 ……… 79Appendix 2: Survey questionnaire for teachers.……….88Appendix 3: Internal reliability calculating results……….90
Trang 10LIST OF FIGURES AND TABLES
FIGURES
Figure 1.1: Relation between evaluation, tests and measurements (Bachman 1990)
……… 10
Figure 2.1: Bloom’s Taxonomy (Old version) ………34
Figure 2.2: Bloom’s Taxonomy (Revised version)……… 34
Figure 2.3: Changes in the old and new version of Bloom’s taxonomy………….35
TABLES Table 1.1: Types of reliability………26
Table 1.2: Range of internal reliability value in Cronbach’s Alpha……… 28
Table 2.1: Test specifications……… 33
Table 2.2: Revised test specifications……… 35
Table 3.1: Teachers’ opinions about the tested skills in questions 71-75……….44
Table 3.2: Teachers’ opinions about the tested skills in questions 76-80……… 45
Table 3.3: Teachers’ opinions about the tested skills in question 81……….46
Table 3.4: Teachers’ opinions about the tested skills in question 82……….47
Table 3.5: Teachers’ opinions about the tested skills in questions 83-84……… 47
Table 3.6: Teachers’ opinions about the tested skills in questions 85………48
Table 3.7: Teachers’ opinions about the tested skills in questions 86-89…………49
Table 3.8: Teachers’ opinions about the tested skills in questions 90-93…………49
Table 3.9: Teachers’ opinions about the tested skills in questions 94-99……… 50
Table 3.10: Teachers’ opinions about the tested skills in questions 100-101……51
Trang 11Table 3.11: Teachers’ opinions about the tested skills in question 102………….51Table 3.12: Teachers’ opinions about the tested skills in question 103………….52Table 3.13: Teachers’ opinions about the tested skills in questions 104-105……52Table 3.14: Teachers’ opinions about the tested skills in question 106………… 53Table 3.15: Teachers’ opinions about the tested skills in questions 107-110……54Table 3.16: Teachers’ opinions about the difficulty level of questions 71-75……54Table 3.17: Teachers’ opinions about the difficulty level of questions 76-80……55Table 3.18: Teachers’ opinions about the difficulty level of question 81……… 55Table 3.19: Teachers’ opinions about the difficulty level of question 82……… 56Table 3.20: Teachers’ opinions about the difficulty level of questions 83-84……56Table 3.21: Teachers’ opinions about the difficulty level of question 85……… 56Table 3.22: Teachers’ opinions about the difficulty level of questions 86-89……57Table 3.23: Teachers’ opinions about the difficulty level of questions 90-93……57Table 3.24: Teachers’ opinions about the difficulty level of questions 94-99……58Table 3.25: Teachers’ opinions about the difficulty level of questions 100-101…58Table 3.26: Teachers’ opinions about the difficulty level of questions 102………59Table 3.27: Teachers’ opinions about the difficulty level of question 103………59Table 3.28: Teachers’ opinions about the difficulty level of questions 104-105…60Table 3.29: Teachers’ opinions about the difficulty level of question 106………60Table 3.30: Teachers’ opinions about the difficulty level of questions 107-110…60
Trang 12Table 3.31: The distribution of the skill tested in the test specifications and thecourse guide……… 62Table 3.32: Internal reliability statistics of the test……… 65
Trang 13LIST OF ABRREVIATIONS
CAE: The Certificate in Advanced English
ELT: English Language Teaching
FCE: The First Certificate in English
FELTE: Faculty of English Language Teacher Education
IELTS: International English Language Testing System
PET: The Preliminary English Test
TOEFL: Test of English as a Foreign Language
ULIS: University of Languages and International Studies
VNU: Vietnam National University, Hanoi
Trang 15PART I: INTRODUCTION
Trang 161 Statement of research problem and rationale for the study
Assessment plays an important role in the process of teaching and learning
It provides teachers “the information that is used for making decisions aboutstudents, curricula and programs, and educational policy” (Nitko, 1996) That kind
of information can be collected from assessments through a number of tools such astests, students’ diaries, portfolios, etc Among those tools, testing is an importanttool for educational assessment in general and for language assessment inparticular According to Heaton (1988), testing can be used as a method to fastenthe relationship between teaching and learning and to motivate students or simply
to assess the language performance of the students It has a close relationship withteaching which is “so closely interrelated that it is virtually impossible to work ineither field without being constantly concerned with the other” (Heaton, 1988).Testing is a tool to “pinpoint strengths and weaknesses in the learned abilities of thestudent” (Henning, 1987, p.1) It means that thanks to testing, teachers can evaluateand assess both their students’ abilities in general and language ability in particularand the teaching and learning process At the same time, learners can also self-evaluate their ability through testing Thus, Read (1997) states that “a test can helpboth teachers and learners to clarify what the learners really need to know.”Obviously, not only the teachers but also learners may achieve the benefits throughtesting That’s the reason why testing is implemented in schools at different levels
in general and at University of Languages and International Studies, VietnamNational University, Hanoi (ULIS, VNU) in particular
In spite of its crucial importance, designing a test is not an easy work.Sometimes, the content of the tests maybe suitable for this type of learners and theirlevels but it is not suitable for other types of learners who are at different levels.Thus, learners’ abilities may be evaluated inappropriately In fact, some universities
in Vietnam, a non-native English speaking environment, have to buy the test formatfrom a prestigious university to have a standardized test Nonetheless, it is still not
Trang 17fair enough if the original tests are applied for Vietnamese students without anychanges in the test content.
In the context of ULIS, VNU, tests for students are sometimes adapted fromCambridge University with the tests at various levels such as PET, FCE and CAE
or tests from International English Language Testing System (IELTS) andTest of English as a Foreign Language (TOEFL) However, there is a fact that what
is tested may not be exactly what is taught in the course because the test is designedfrom faraway universities and the levels of students of those universities aredifferent from the supposed standards of the university, ULIS, VNU Besides, sometests are made by teachers themselves In this case, the tests may not be verifiedand thus its quality cannot be assured In addition, “what test writers are concernedwith seems to be the reliability of the test and its validity” (Le, 2010) That is tosay, reliability and validity are the two most essential qualities of a good languagetest They are also the main considerations of test writers when designing a test.However, a test may be reliable but not valid For example, a reading test withmany multiple-choice questions about vocabulary and grammar used in the passage
is reliable; nonetheless, it is not valid because it tests not only students’ readingskills but also their vocabulary and grammar It violates rules of testing that assuretest validity Therefore, “the most important quality to consider in the development,interpretation, and use of language tests is validity” (Bachman, 1990) Besides,little research work has been done into evaluating the validity of reading tests atULIS, VNU There is only some research into the validity of writing and speakingachievement tests such as those done by To (2001) and Dang (2008) Meanwhile,the importance of reading skill is as cardinal as other skills like listening, speakingand writing Thus, evaluating the validity of a reading test also plays a considerablerole in contributing to improve the quality of a test, hence reinforce the relationshipbetween testing and teaching and learning
All of the aforementioned reasons have inspired the researcher to carry out thisstudy to evaluate some aspects of the validity of reading achievement tests for
Trang 18second year mainstream students at FELTE, ULIS, VNU by evaluating an term reading test in academic English (3B) for second year mainstream students atFELTE, ULIS, VNU at the end of the first semester in the school year 2013 - 2014.Hopefully, the result of the study can then help to improve the quality of theachievement tests for mainstream sophomores at FELTE, ULIS.
end-of-2 Goals and objectives of the study
The study aims at reviewing and analyzing current theories on validity inorder to investigate the validity of a reading achievement test hence identifyingand evaluating the level of validity that the 3B end-of-term reading test hasobtained in terms of content validity, structure validity, and internal reliability
In addition, the study is also expected to give some suggestions to improve theflaws, if any, of the current 3B end-of-term reading test for second yearmainstream students at FELTE, ULIS, VNU
3 Research questions
The study is conducted to answer the following questions:
To what extent is the 3B end-of-term reading test for second yearmainstream students in the school year 2013 - 2014 at FELTE, ULIS,VNU valid in terms of content validity as perceived by teachers?
To what extent is the 3B end-of-term reading test for second yearmainstream students in the school year 2013 - 2014 at FELTE, ULIS,VNU valid in terms of structure validity?
What is the internal reliability of the 3B end-of-term reading test forsecond year mainstream students in the school year 2013 - 2014 atFELTE, ULIS, VNU?
4 Significance of the study
Once completed, the study would bring about certain advantages To bemore specific, the findings of the research would supply to test makers andteachers in the targeted context some useful information about whether theirinferences about the students are accurate or not and whether how the readingresult is true to students’ ability Hence, test makers and teachers or testing
Trang 19experts would get more information about the real situation of testing atFELTE, ULIS, VNU and thus, find out solutions for the inadequacy, if any, inthe test’s content and structure In addition, the research would also be a source
of references for further research in the same field
5 Scope of the study
Firstly, in this paper the researcher emphasizes the reading achievement testsinstead of investigating other kinds of tests such as replacement tests ordiagnostic tests Secondly, the exploration of other language skills like writing,listening and speaking is not included in this study Furthermore, within thescope of graduation paper, only two aspects of validity, including content andstructure validity, together with the internal reliability of the test are studied It
is said that content and structure are two of the most important aspects of thevalidity of a test; moreover, internal reliability is also a crucial factor inevaluating the quality of a test The consistency within test items contributes agreat part in the quality of a good test Thirdly, due to the limitation of time andexperience, the study is carried out only with a reading achievement test inEnglish 3B Course – English for Academic Purposes for a group of second yearmainstream students at FELTE, ULIS, VNU The researcher chooses this group
of sample among the four school years at university as the second year is said to
be the most significant in the new local curriculum It is the last year that thestudents study with a common curriculum before they specialize into differenttypes of major dimension In addition, English 3B is more useful andfundamental for academic purposes than English 3A and 3C, which are Englishfor Social Purposes and English for Standardized Testing Purposes respectively.Thus, the researcher decides to study only the 3B reading achievement test ofsecond year mainstream students at FELTE, ULIS, VNU
The study provides empirical evidence of the current reading achievementtests and proposes practical suggestions on the improvement of the end-of-termreading test for ULIS second year students in general
Trang 206 Methodology of the study
In this study, the test is evaluated by adopting both qualitative andquantitative methods The research is quantitative in the sense that the datawill be collected through the analysis to the scores of the 100 random testpapers of students at FELTE, ULIS To calculate the internal reliabilityresearcher uses Cronbach’s Alpha formula via SPSS 16.0 software It isqualitative in the aspect of using open-ended survey questionnaires whichwere delivered to teachers at Division II, FELTE, ULIS,VNU and thecomparison of the test specification with the lesson objectives in theEnglish 3B course guide The conclusion to the analysis and the comparisonwould be used as the qualitative data of the research
7 The organization of the study
The study is divided into three parts:
Part I: Introduction – is the presentation of basic information such as the
statement of research problem, the rationale, the scope, the objectives, themethodology, and the organization of the study
Part II: Development – This part consists of three chapters
Chapter 1: Literature Review – in which the literature that related to
language of testing and test evaluation
Chapter 2: Methodology – is concerned with the methods of the study,
the selection of participants, the materials and the methods of datacollection and analysis as well the results of the process of data analysis
Chapter 3: Results and Findings - in which the results of the study is
presented and analyzed; and some findings are also reported
Part III: Conclusion – this part will be the summary to the study,
limitations as well the recommendations for further studies
Trang 21PART II: DEVELOPMENT
Trang 22CHAPTER 1: LITERATURE REVIEW
This chapter is an attempt to establish the theoretical background for the study.The key concepts of language testing including measurement, tests and evaluation,validity and reliability and some related studies worldwide and in Vietnam will bereviewed
1 Key concepts
1.1 Assessment, measurement, test and evaluation
The terms “assessment”, “measurement”, “test” and “evaluation” are sometimesused as synonyms and in reality they can mention to the same activity (Bachman,1990) For example, when someone is asked about his or her evaluation of astudent, he or she often gives the test score of that students and bases on that toevaluate the student However, they still have some distinctive features
1.1.1 Assessment
According to Nitko (1996), assessment is “a broad term defined as a processfor obtaining information that is used for making decisions about students, curriculaand programs, and educational policy.” For example, basing on assessment,teachers can make decisions about managing classroom instruction, placingstudents into different types of educational programs, assign them to appropriatecategories or making decisions about the effectiveness of the programs or solutions
to improve them (Nitko, 1996) Because assessment is a broad term, it can bedrawn that assessment refers to all methods used to gather information aboutlearner’s knowledge and skills Shortly speaking, “assessment is a broader termthan test or measurement, because not all types of assessments yieldmeasurements.” (Nitko, 1996)
1.1.2 Measurement
Bachman (1990) states that measurement, in the social science, is “theprocess of quantifying the characteristic of persons according explicit procedures
Trang 23and rules.” Nitko (1996) clarifies measurement is “a procedure for assigningnumbers to a specific attribute or characteristic of a person in such a way that thenumbers describe the degree to which the person possesses the attribute.” Fromthese points of view, it is shown that measurement is a method of interpretingnumber to find out the attribute of a subject Due to its features, measurement isessential in language teaching and learning.
From this definition, Bachman (1990) states that a test is one type ofmeasurement that is designed to “elicit a specific sample of an individualbehavior” However, he also adds that the distinction between a test and other types
of measurement is that it is used to obtain a particular action Obviously,measurement is a broader concept than a test A test, as stated above, is just a toolfor assessment or measurement
1.1.4 Evaluation
Nitko (1996) gives the definition of evaluation as “the process of making avalue judgment about the worth of a student’s product of performance.” At thispoint, he emphasizes the relationship between students’ behaviors and the judgment
on them At the same point, Genesee and Upshur (1996) claim that evaluation isbasically about making decision This is also the view of Weiss (1972, cited inBachman, 1990) in which evaluation is “the systematic gathering of information forthe purpose of making decisions.” However, evaluation may be separated fromtests and measurements In this situation, evaluation might be carried without anytest or measurement because “evaluation may or may not be based on
Trang 24measurements or test results” (Nitko, 1996) As such, evaluation “does notnecessarily entail testing” (Bachman, 1990) The relationship between evaluation,tests and measurement is represented in the chart below:
Figure 1.1: Relation between evaluation, tests and measurements (Bachman, 1990)
As can be seen from the graph, evaluation and measurement are two differentnotions but they still have some common features Furthermore, testing is a method
of measurement and hence tests also have some shared characteristic withevaluation In a nutshell, evaluation, measurement and tests always have a closerelationship with each other and they are essential components of language testing
There are various ways to classify test purposes Wiersma and Jurs (1990)suggest a list of test purposes about the tasks that a test is expected to perform
Description: Many tests are developed to describe the current status of
individuals on a wide range of variables
Trang 25Prediction: It means that some tests are used for the purpose of predicting
examinees’ performance in the future
Assessing individual differences: Some tests are used to differentiate
between people in order to identify those who are the highest and those whoare the lowest on some measures
Objectives evaluation: Some tests are used to report progress of the students
compare with the objectives of a course or a program and to plan instruction
in terms of the objectives
Domain estimation: For this purpose, many tests are designed to estimate the
percentage of a domain that the student understands
Mastery decisions: Some tests need to be constructed in such a way that
mastery and non-mastery are unambiguously determined and the mastersand non-masters are clearly separated by test scores
Diagnosis: Many tests are designed to diagnose the students’ strength and
weakness through test performance, or to be more specific, through testscore on one or more tests
Pre- and Post-assessment: “The purpose of many tests is to document the
gains that students have made in school” (Wiersma & Jurs, 1990) That is tosay, the tests in this case focus on the change in the status or the score oftest-takers in the pre-test and the post-test
These purposes are the bases for classifying test types Basing on the purposes
of the test, test designers give the appropriate name for the tests, design testspecifications and let test takers do the tests
1.3 Types of test items
According to Brown (1996), an item in a language test is “the smallest unit thatproduces distinctive and meaningful information on a test or rating scale.” Thisdefinition has already shown the use of test items and its importance in a test It is abasic unit of a test There are different types of test items Nonetheless, they may be
Trang 26grouped into two main groups of test items that are objective and subjective testitems.
1.3.1 Objective test items
Objective test items are items that can be marked objectively They includemultiple choice questions, true-false items and matching items Subjective testitems consist of gap-filling items, short answer questions and essay items Eachtype of these items has its own features
Multiple-choice items
The most difficult type of test items to make is multiple-choice items Amultiple-choice item, according to Nitko (1996), “consists of one or moreintroductory sentences followed by a list of two or more suggested responses” Inthis type of item, students have to choose the right answer from the options listedfor the question A multiple-choice test item includes two parts: stem andalternatives The stem is the part of the item that asks question (Nitko, 1996) Inorder to ensure the quality of the item, the stem should be written carefully so thatstudents may understand what task to perform or what question to answer.Alternatives are the listed responses in the item They can be called by variousnames, for example, alternatives, choices, responses, and options Test designersoften arrange the alternatives in a meaningful way (logically, numerically,alphabetically, etc.) in order not to clue the answer for the students and savestudents’ time (Nitko, 1996) Among the alternatives of an item, there are keyedalternative and distractors The keyed answer or the key alternative, or simplyspeaking the key is the only correct or the best answer to the question or theproblem posed For the purpose of ensuring the validity of the item, the test should
be designed to assess students’ performance on different formats of tasks Inaddition, the levels of difficulty of test items also need adjusting appropriate tostudents’ proficiency The purpose of the tasks is also another consideration beforecrafting multiple-choice items Nitko (1996) states that:
Trang 27“The basic purpose of an assessment task, whether or not it is a choice item, is to identify those students who have attained a sufficient (or necessary) level of knowledge (skill, ability, or performance) of the learning target being assessed”
multiple-If these aforementioned things are assured, the test will be much more valid
true-“cover a wide range of content within a relatively short period” (Nitko, 1996).However, the problem is that if the items are not constructed well, they can onlyassess specific, trivial facts and test takers can guess the answers; this type of itemcan also be worded ambiguously Thus, the test can become invalid
1.3.2 Subjective items
Subjective items are items that must be marked with subjective judgmentfrom the markers This type of items includes short answer and completion items
Short answer and completion items
“Short answer items require a student to respond to each item with a word,short phrase, number or symbol” (Nitko, 1996) According to Nitko (1996), thereare three various types of short answer items: question, completion and association.The question variety asks students a direct question whereas the completion varietyexpects students to add words to complete an incomplete statement Meanwhile, theassociation variety includes ‘a list of term or a picture for which students have torecall numbers, labels, symbols, or other items.’ (Nitko, 1996)
Trang 28Short answer and completion task type are easy to construct and students havelower probability of guessing the correct answer However, it is difficult to scorethis type items subjectively That is to say, the rater cannot foresee all the possibleresponses that students can make In marking essay items, subjectivity isunavoidable Although subjective judgment is appropriate, it makes the scoringprocess slow down and hence the reliability of the obtained scores is also lowered.Test designers, therefore, should consider all the advantages and disadvantages
of types of test items before crafting any of them to assure the quality of a goodtest The test items in a test should also be various to assure the reliability as well asvalidity of a test
1.4 English reading achievement tests
1.4.1 Definition of reading
Caroll (1964) emphasizes reading as “the activity of reconstructing themessages that reside in printed text” This conception of reading as the finding ofpre-existent meanings is arguably the dominant construct in many readingcomprehension tests, especially those that rely heavily on multiple-choice formats(Hill & Parry, 1992; Alderson, 2000) By the same token, Aebersold (1997)suggests that reading is the interaction between the reader and the text This point
of view is clarified in McKay’s study (2006) To him, reading includes makingmeaning from the text It is often accompanied by writing and called literacy(McKay, 2006) That is to say, reading is a skill that often has the interaction ofreaders with the text and testing reading is mainly in written form (Camaron, 2001cited in McKay, 2006) Furthermore, reading is not process but also product(McKay, 2006) The process there refers to the process of interaction between thereaders and the text, which is also pointed by Aebersold (1997) According toMcKay, the product of reading is reading comprehension He also argues that therecan be two approaches to assess reading which are examining the process ofreading and examining carefully the product of reading or compare that productwith the original reading text
Trang 291.4.2 The construct of reading ability in English reading tests
Anderson (2000) defines that “a construct is a psychological concept, whichderives from a theory of the ability to be tested Constructs are the maincomponents of the theory, and the relationship between these components is alsospecified by the theory.” For instance, “some theories of reading state that there aredifferent constructs involve in reading (skimming, scanning, etc.) and that theconstructs are different from one another” (Anderson, 1995) In addition, it isnecessary to bear in mind that constructs are not psychologically real entities butthey are abstractions used for the purposes of assessment (Anderson, 2000)
In terms of the construct of reading ability, Alderson (2000) argues that thebasis for the reading constructs is a model of reading and factors affecting reading
He suggests that test designers should include word recognition ability, and the
automaticity with which this happens, is obviously the centre of fluent reading
(Anderson, 2000) Moreover, meta-cognitive knowledge and monitoring are alsoregarded as important components of good reading There are also informal readingassessments According to Ruscoe (2002), informal reading tests often test readers’
phonological awareness, phonics, fluency, vocabulary and comprehension.
Nevertheless, in the First Certificate in English test (FCE), the description of thereading test is as follows:
“Candidates are expected to be able to read semi-authentic texts of various kinds (informative and general interest) and to show understanding of gist, detail and text structure and to deduce meaning”
Trang 30“(Know) how to understand main ideas and how to find specific information; (Do) survey the text; analyze the questions; go back to the text to find answers; check your answers.”
(De Witt, 1995)
Carver (1997) recognises five basic elements: scanning, skimming, rauding,
learning and memorising Rauding is defined as ‘normal’ or ‘natural’ reading,
which occurs when adults are reading something that is relatively easy for them tocomprehend (Carver, 1997) For Grabe and Stoller (2002), the activity of reading isbest captured under seven headings:
Reading to search for simple information
Reading to skim quickly
Reading to learn from texts
Reading to integrate information
Reading to write (or search for information needed for writing)
Reading to critique texts
Reading for general comprehension
One notes that this latter list takes on a slightly simplified form in a recentstudy conducted for the TOEFL reading test (Enright et al., 2000):
Reading to find information (or search reading)
Reading for basic comprehension
Reading to learn
Reading to integrate information across multiple texts
However, in a recent study into the IELTS academic reading test conducted in
2009, Weir et al propose a more useful and simple taxonomy Instead of compiling
a list of separate skills, the author construct their taxonomy around two dimensions
of difference: reading level and reading type In terms of reading level, there is adistinction made between reading processes focused on text at a more global level,
Trang 31and those operating at a more local level In terms of reading type, the distinction isbetween what is called ‘careful’ reading and ‘expeditious’ reading, the formerinvolving a close and detailed reading of texts, and the latter involving “quick andselective reading to extract important information in line with intended purposes”(Weir et al., 2009) The ‘componential matrix’ formed by the two dimensions ofWeir and his colleagues has the advantage of being a more dynamic model, onethat is capable of generating a range of reading modes.
In a nutshell, it can be concluded that the construct of reading in different testsare various which depends on the purposes of the tests Test takers are testedreading ability at different levels, which are appropriate to their capability Fromthose construct, test designers can base on and make appropriate test items or testtasks Thus, the quality of a good test may be better assured
1.5 The technical quality of reading tests
1.5.1 Overview of all qualities
There are some qualities that a good reading test should contain Bachman(1996) contends test usefulness contains six main qualities, namely, validity,reliability, impact, authenticity, practicality, and instructiveness
Validity
Bachman (1990) states that people often concern validity with the question,
“How much of an individual’s test performance is due to the language ability wewant to measure?” and with “maximizing the effects of these abilities on testscores” It can be entailed that validity refers to the extent to which the specificinference made from test scores is appropriate, meaningful and useful.Furthermore, validity is the most important quality of a good test Asmentioned above, a test can be reliable but not valid In other words, a test,despite its reliability, can have no validity (Brown, 1996) Bachman (1990)also points out that “reliability is a requirement for validity” That is to say, the
Trang 32quality of a test is not assured if its validity is not achieved In short, it can bedrawn that validity is the most important characteristic of a good test
Reliability
Bachman and Palmer (1996) claim that “reliability is often defined asconsistency of measurement” (p.19) By the same token, reliability, in Geneseeand Upshur’s (1996) point of view, refers to the consistency, the stability and thefreedom from nonsystematic fluctuation This definition covers nearly all theaspects of reliability that other researchers concern Bachman (1990) suggests thatreliability is concerned with answering the question, ‘How much of an individual’stest performance is due to measurement error, or to factors other than the languageability we want to measure?’ For example, a student’s score on a test in the firsttime should be equal to that of the second time he or she takes the same test.Reliability together with validity is two of the most important qualities of a goodtest However, Giap (2008) argues that reliability is regarded to be a “necessary butnot sufficient” quality of a good test As such, test can be reliable but not valid Forexample, a reading test with multiple-choice questions about grammar andvocabulary may be reliable but it is not valid because it does not measure thereading skills only
Impact
Impact, according to Bachman and Palmer (1996), can be defined broadly interms of the different ways the use of a test affects a society, an educational system,and the individuals within them Generally, a test operates on a large scale in asocietal educational system while corresponding to individuals that are test takers,
on a small scale
Authenticity
Bachman (1991) defines authenticity as the appropriateness of a languageuser’s response to language as communication However, this definition was notspecific enough Therefore, Bachman and Palmer (1996) divide this idea into two
Trang 33parts The first one relates to the target language's use (TLU), which they refer to asauthenticity; and they define the second one according to its relation to the learnersinvolved in the test The two authors regard authenticity as the degree to which thecharacteristics of a given language test tasks correspond to the features of a TLUtask Authenticity also relates a test's tasks to the domain of generalization to whichtest designers want the interpretations of the scores to be generalized It potentiallyaffects test takers' perceptions of the test and their performance (Bachman, 2000).
Practicality
“Practicality is the relationship between the resources that will be required indesign, development, and use of the test and the resources that will be available forthese activities” (Bachman & Palmer, 1996, p.36) Thus, administration is theprimary question of practicality (Genesee & Upshur, 1996) That is to say, thepracticality of a good test concerns criteria such as its administrative time, thefacilities required for administrate the test, the printing of the papers, the personneland the handling of marking scores It has a close relationship with the reliabilityand the validity of a test For instance, if the printing of the test is not good then itcan violate the response of the test takers hence affect the validity as well as thereliability of the test Thus, test designers should avoid these things like this Inconclusion, “tests should be as economical as possible in time (preparation, sittingand marking) and in cost (materials and hidden costs of time spent)” (Heaton,1990)
Interactiveness
Interactiveness, according to Bachman and Palmer (1996), is “the extent andtype of involvement of the test taker’s individual characteristics in accomplishing atest task” (p 25) The interactiveness of a test is affected by a number of factorswhich are often accompanied by questions such as Does the test motivate students?
Is the language used in the test's questions and instructions appropriate for thestudents' level? Do the test's items represent the language used in the classroom, aswell as the target language?
Trang 34At the same time, there are also some different qualities of a good test such
as discrimination, the level of difficulty and the mean (Giap, 2008) Accordingly,discrimination is “the spread of scores produced by a test, or the extent to which
a test separates students from one another on a range of scores from high tolow” (Giap, 2008) Discrimination is also used to describe “the extent to which
an individual multi-choice item separates the students who do well on the test
as a whole from those who do badly” (Giap, 2008) The level of difficulty, asGiap (2008) states, is “the extent to which a test or test item is within the abilityrange of a particular candidate or group of candidates” The mean is a
“descriptive statistic, measuring central tendency The mean is calculated bydividing the sum of a set of scores by the number of scores”
Because of time and experience limitation, in this study, the researcherfocuses on only two primary qualities of a good test, which are validity andreliability
1.5.2 Validity
1.5.2.1 Definition of validity
Bachman (1990) suggests validity is a “unitary concepts” That is, althoughevidence of validity can be collected in various ways, it always refers to “thedegree which that evidence supports the inferences that are made from the scores”(Bachman, 1990, p 237) This point of view has been clarified by Brown (1996)
He defines validity as “the degree to which a test measures what it claims, orpurports, to be measuring” (Brown, 1996, p.231) Nevertheless, Messick (1995)contends that validity is “the meaning of the test scores” He claims that “Indeed,validity is broadly defined as nothing less than an evaluative summary of both theevidence for and the actual – as well as potential- consequences of scoreinterpretation and use” (Messick, 1995) This view integrates content, criteria aswell as consequences consideration into a construct framework for the hypothesesabout score meaning and use In a nutshell, validity is about the matter of the
Trang 35meaning of test scores to indicate whether the test measures what it is supposed tomeasures.
1.5.2.2 Aspects of validity
Traditionally, validity is divided into three “separate and substitutable types”namely, content validity; construct validity, and criterion validity (Hughes, 1989;Bachman, 1990; Nitko, 1996)
Content validity
Content validity refers to the content relevance and content coverage of atest (Bachman, 1990) In terms of content relevance, a test is valid if its domainspecification and the specification of the test method facets are valid A test hascontent validity built into it by careful selection of which items to include Itemsare chosen so that they comply with the test specifications that is drawn up through
a thorough examination of the subject domain The content coverage of the test, onthe other hand, considers “the extent to which the tasks required in the testadequately represents the behavioral domain in question” (Bachman, 1990).Content validity is very important in evaluating the validity of the test in terms ofthat “the greater a test’s content validity, the more likely it is to be an accuratemeasure of what is supposed to measure” (Hughes, 1989, p.22)
Construct validity
A test has construct validity if it demonstrates an association between thetest scores and the prediction of a theoretical trait Intelligence tests are oneexample of measurement instruments that should have construct validity Constructvalidity is viewed from a purely statistical perspective in much of the recentAmerican literature (Bachman & Palmer, 1996) It is seen in principle as a matter
of the posterior statistical validation of whether a test has measured a construct thathas a reality independence of other constructs
To understand whether a piece of research has construct validity, three stepsshould be followed First, the theoretical relationships must be specified Second,
Trang 36the empirical relationships between the measures of the concepts must beexamined Finally, the empirical evidence must be interpreted in terms of how itclarifies the construct validity of the particular measure being tested.
Criterion-related validity
Criterion-related validity is used to demonstrate the accuracy of a measure orprocedure by comparing it with another measure or procedure which has beendemonstrated to be valid In other words, the concept is concerned with the extent
to which test scores correlate with a suitable external criterion of performance.Criterion-related validity consists of two types: concurrent validity, where the testscores are correlated with another measure of performance, usually an olderestablished test, taken at the same time and predicative validity, where test scoresare correlated with some future criterion of performance (Bachman, 1990)
However, Messick (1995, p.741) argues that this conception is “fragmentedand incomplete” because it is not successful in addressing the score meaning andthe social value in test interpretation and test use He supposes validity is a unifiedconcept integrating content validity, criterion validity, and consequence of scoreuse into a construct framework, or construct validity In this framework, he raisessix main aspects of construct validity That is to say, construct validity containscontent, substantive, structural, generalizability, external and consequential aspects.Apparently, Messick’s view considers the validity of tests in a more detailed wayand thus, the effects of evaluating the quality of test can be enhanced Therefore,hereafter, the validity mentioned in this study is classified according to Messick’s(1995) framework and the researcher would base on this framework to evaluate thevalidity of the reading test
The content aspect of construct validity contains evidence of contentrelevance, and representativeness (Messick, 1995) of the test The contentrelevance of the test refers to the coverage of important parts of the constructdomain and the representativeness indicates both the difficulty level of the tasks inthe test and the coverage of important parts of the construct domain Simply
Trang 37speaking, in the language of achievement testing, the content validity of the testconsiders the question whether the test tests what the learners have been taught orsomething unrelated to the learnt contents That is, the test has appropriate tasks tothe specifications of the boundaries of the construct domain – which are thedetermining knowledge, skills, attitudes, motives, and other attributes to berevealed by the assessment tasks - addressed to the test
The substantive aspect of construct validity refers to the theoretical rationalefor the observed consistencies in test responses, including process models of taskperformance, as well as empirical evidence that the theoretical processes areactually engaged by respondents in the assessment tasks (Messick, 1995).Accordingly, this aspect of construct validity emphasizes two important things thatare the need for tasks providing appropriate sampling of domain processes inaddition to traditional coverage of domain content, and the need to move beyondtraditional professional judgment of content to accumulate empirical evidence thatthe apparently sampled processes are actually engaged by respondents in taskperformance The substantive aspect adds to the content aspect of construct validitythe need for empirical evidence of response consistencies or performanceregularities reflective of domain processes
The structural aspect of construct validity, on the other hand, concerns the
“fidelity of the scoring structure to the structure of the construct domain at issue(Loevinger, 1957; Messick, 1989b” (Messick, 1995) It means that there should be
a reasonable connection between the construct domain structure and the scale ofscoring system For example, in a language test, the score given to each skills tested
in the test should correlate rationally with the importance of the skill mentioned inthe course objectives
The generalizability aspect of construct validity mentions the extent towhich score properties and interpretations generalize to and across populationgroups, settings, and tasks In the other words, the generalizability aspect has aclose relationship with the score meaning The generalizability of the test depends
Trang 38on “the correlation of the assessed tasks with other tasks representing the construct
or aspects of the construct” (Messick, 1995)
The external aspect of construct validity refers to “the extent to which theassessment scores' relationships with other measures and non-assessment behaviorsreflect the expected high, low, and interactive relations implicit in the theory of theconstruct being assessed” (Messick, 1995) That is, the meaning of the scores isevaluated by examining the degree to which empirical relationships with othermeasures are consistent with that meaning In other words, the constructsrepresented in the assessment should reasonably explain the external pattern ofcorrelations
The consequential aspect of construct validity assesses the evidence andrationales for evaluating the intended and unintended consequences of scoreinterpretation and use in both the short-and long-term
Furthermore, because of the researcher’s limitation of time and experience,the study focuses on only the content validity and structure validity of the test Inother words, the focus of this research is on the content and structural aspects ofconstruct validity Thus, these two aspects are paid much more attention in thisresearch than the other four aspects mentioned above
After all, it can be concluded that the content and the structural aspects ofconstruct validity of a test play an important role in its quality It might not beenough if these two aspects of quality are ignored in evaluating an achievementtest
1.5.2.3 Factors affecting the validity of reading tests
Messick (1995) raises two major factors that threaten the validity of a test,namely, construct underrepresentation and construct-irrelevant variance Theformer factor is caused by the fact that the test is “too narrow and fails to includeimportant dimension or facets of focal constructs” (Messick, 1995, p.742) That is tosay, the test does not measure adequately what the test mainly intend to measure
Trang 39Meanwhile, the later factor is made because “the assessment is too abroad,containing excess reliable variance that is irrelevant to the interpreted construct.”(Messick, 1995, p.742)
Basically, construct-irrelevant variance can be divided into two kinds Intesting, the two kinds of construct-irrelevant variance might be called construct-irrelevant difficulty and construct- irrelevant easiness (Messick, 1995) The reason
of construct-irrelevant difficulty, according to Messick (1995), is that some aspects
of the focusing construct are irrelevantly difficult for some individuals or groupstaking the test This may lead to the invalidly low scores for the individuals orgroups who are affected and the “bias in test scoring and interpretation and ofunfairness in test use.” (Messick, 1995) In contrast, construct-irrelevant easiness iscaused by the fact that some clues or items in the test can lead to some individuals
or groups to get the answer without considering what the test provides This type ofinvalidity also occurs when test materials is “either deliberately or inadvertently,highly familiar to some respondents” (Messick, 1995) Unlike construct-irrelevantdifficulty, construct-irrelevant easiness leads to “invalidly high scores” for theaffected individuals or groups
In conclusion, test designers should avoid these two threats of invalidity inwhen designing a test order to make it become a good test sample
1.5.3 Reliability
1.5.3.1 Definition of reliability
As stated above, reliability is defined by Bachman and Palmer (1996) as
“the consistency of measurements” This quality of a test often considers the testscores of test takers A reliable test score will be consistent across differentcharacteristics of the testing situations Hence, reliability can be considered afunction of consistency of scores from one set of test tasks to another
Nevertheless, it is said that reliability is a necessary but not a sufficientquality of a test Additionally, the reliability of a test should be closely
Trang 40engaged with its validity While reliability focuses on the empirical aspects ofthe measurement process, validity focuses on theoretical aspects and seeks tointerweave these concepts with the empirical ones Thus, it is easier to assessreliability than validity.
1.5.3.2 Types of reliability
According to test evaluators, reliability can be estimated by some ofmethods such as “parallel form, split half, rational equivalence, test-retest andinter-rater reliability checks” (Milanovic et al., 1999, p.168) According toShohamy (1985), the following table summarizes the types and the description
as well the ways to calculate the reliability:
Table 1.1: Types of reliability
1 Test-retest The extent to which the test
score are stable from one administration to another assuming no learning occurred between two occasions
Correlations between scores
of the same test given on two occasions
2 Parallel form The extent to which 2 tests
taken from the same domain measure the same things
Correlations between two forms of the same rater on different occasions or one occasion
3 Internal consistency The extent to which the test
questions are related to one another, and measure the same trait
Kuder-Richardson Formula 21
4 Intra-rater The extent to which the same
rater is consistent in his rating form one occasion to another, or in occasions but
Correlations between scores
of the same rater on different occasions, or one occasion.