Abstract This paper is primarily conducted to evaluate the test reliability and the quality of test questions in Sample Vietnam National High School English Examination VNHSEE 2017.. Cla
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FACULTY OF ENGLISH LANGUAGE TEACHER EDUCATION
GRADUATION PAPER
THE QUALITY OF TEST QUESTIONS IN
SAMPLE VIETNAM NATIONAL HIGH SCHOOL
Trang 2ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC NGOẠI NGỮ KHOA SƯ PHẠM TIẾNG ANH
KHÓA LUẬN TỐT NGHIỆP
CHẤT LƯỢNG CÂU HỎI TRONG ĐỀ MINH HOẠ TRUNG HỌC PHỔ THÔNG QUỐC GIA NĂM 2017
MÔN TIẾNG ANH
Giáo viên hướng dẫn: TS Nguyễn Thị Ngọc Quỳnh Sinh viên: Lê Hoàng Kim Khuê
Khoá: QH2013.F1.E2
HÀ NỘI – 2017
Trang 3Signature of Approval:
_ Supervisor’s Comments & Suggestions
_ _ _ _ _
Trang 4Acceptance
I hereby state that I: Lê Hoàng Kim Khuê, class QH2013.E2, being a candidate for the degree of Bachelor of Arts (TEFL) accept the requirements of the College relating to the retention and use of Bachelor’s Graduation Paper deposited in the library
In terms of these conditions, I agree that the origin of my paper deposited in the library should be accessible for the purposes of study and research, in accordance with the normal conditions established by the librarian for the care, loan or reproduction of the paper
Signature
Lê Hoàng Kim Khuê
Trang 6Acknowledgements
I would like to commence expressing my profound gratitude to my supervisor,
Dr Nguyễn Thị Ngọc Quỳnh, who gave me enthusiastic instructions, precious support and critical feedback on the construction of the study She has given me great opportunities to gain the very first, yet valuable hands – on experiences in this realm
of language testing Her guidance has always been one of decisive factors in the completion of this thesis
My sincere thankfulness to Dr Nathan Carr for willingly assisting me in every stage of conducting this paper and encouraging me to delve further into this intriguing field He has inculcated in me the habit of meticulousness I am also grateful for his comments and direction
I would also like to articulate my gratitude to my thesis examiners, Dr Dương Thu Mai and MA Ngô Xuân Minh If it weren’t for their reading, comments and evaluation on my progress reports, my thesis would not have been completed
My deep sense of thanks to all the teachers and seniors at Hanoi Amsterdam High School, who participated in this research, for their cooperation
Last but not least, my heartfelt thanks to my family and friends I am extremely grateful to my mother for her continued encouragement and support, especially in my darkest moments I would also like to appreciate Ms Bùi Thiện Sao for her assisstance with the theoretical knowledge and data analysis throughout the research
Trang 7Abstract
This paper is primarily conducted to evaluate the test reliability and the quality
of test questions in Sample Vietnam National High School English Examination (VNHSEE) 2017 The test is categorized as both a norm – referenced placement test and criterion – referenced achievement test Classical Test Theory (CTT) was applied
to examine item analysis and estimate reliability coefficients of the sample test Pilot testing was carried out with a convenience sample of 200 grade – 12 students at Hanoi – Amsterdam High School Subsequently, descriptive analysis, item analysis, reliability coefficient computation and distractor analysis of the sample test were carried out
With respect to the difficulty and discrimination level, VNHSEE 2017 was rather easy to the sample group As a norm – referenced test serving the purpose of placement, the sample test generally separated selected examinees rather effectively However, with the aim of an achievement test, VNHSEE 2017 distinguished low – end students better than high – end students
In addition, the test achieved notably high internal reliability, which illustrates its consistency of scoring and classification The amendments to test format could be attributable to the enhancement of its reliability and dependability coefficients
Overall, Sample VNHSEE 2017 possessed good quality of test questions However, distractor analysis results highlighted some problematic distractors in need
of revision Item design was reported to be the main culprit of implausible options Discussions and suggestions for improvement regarding specific cases were discretely presented below
In conclusion, VNHSEE 2017 with revised test format have directly influenced English teaching and learning process, with its inevitable washback effects Some implications drawn from current findings, hopefully, propose some suggestions for developing test quality and improving English pedagogical methodologies
Trang 8Table of Contents
Acknowledgements i
Abstract ii
Table of Contents iii
List of Tables and Figures vi
List of Acronyms and Abbreviations vii
CHAPTER 1: INTRODUCTION 1
1.1 Overview 1
1.1.1 Vietnam National High School Examination 1
1.1.2 Vietnam National High School English Examination 2017 2
1.2 Problem Statement and Rationale for the Study 3
1.3 Significance of the Study 6
1.4 Objectives of the Study and Research Questions 7
1.5 Organization of the Study 7
CHAPTER 2: LITERATURE REVIEW 9
2.1 Multiple Choice Questions 9
2.1.1 Definition and Use of Multiple Choice Questions 9
2.2 Classical Test Theory 11
2.3 Item Analysis in Classical Test Theory 12
2.3.1 Overview of Item Analysis for Selected Response Tasks 12
Trang 92.3.2 Item Analysis and Bachman & Palmer’s Qualities of Test Usefulness
Framework (1996) 13
2.3.2.1 Bachman & Palmer’s Qualities of Test Usefulness Framework (1996) 13 2.3.2.2 Item Analysis and Bachman & Palmer’s Qualities of Test Usefulness Framework (1996) 17
2.3.3 Norm – referenced Testing Item Analysis 18
2.3.3.1 Item Difficulty 18
2.3.3.2 Item Discrimination 18
2.3.3.3 Reliabity 20
2.3.4 Criterion – referenced Testing Item Analysis 22
2.3.4.1 Item Difficulty 22
2.3.4.2 Item Discrimination 22
2.3.4.3 Dependability 24
2.3.5 Distractor Analysis 25
CHAPTER 3: RESEARCH METHODOLOGY 27
3.1 Research Setting 27
3.2 Participants 27
3.3 Data Collection 28
3.3.1 Data Collection Instrument 28
3.3.2 Data Collection Procedure 28
3.4 Data Analysis 29
3.4.1 Reasons to Choose Classical Test Theory (CTT) 29
3.4.2 Data Analysis Procedure 31 3.4.2.1 Research Question 1: What is the reliability of the VNHSEE 2017? 31
Trang 103.4.2.2 Research Question 2: Which distractors in Sample VNHSEE 2017 are
in need of revision? 34
CHAPTER 4: RESULTS AND DISCUSSION 35
4.1 Descriptive Statistics 35
4.2 Reliability of Sample VNHSEE 2017 36
4.2.1 Item Parameters 36
4.2.1.1 Norm – referenced Testing (NRT) Item Parameters 36
4.2.1.2 Criterion – referenced Testing (CRT) Item Parameters 39
4.2.2 Reliability of Sample VNHSEE 2017 44
4.2.3 Dependability of Sample VNHSEE 2017 45
4.3 Distractor Analysis 47
CHAPTER 5: CONCLUSION 55
5.1 Summary of Major Findings 55
5.2 Implications 56
5.3 Limitations and Suggestions for Further Research 56
References 58
Appendices 63
Trang 11List of Tables and Figures
Table 1: Rules of Thumb for Interpreting NRT and CRT Item Analyses 33
Table 2: Descriptive Statistics of Sample VNHSEE 2017 35
Table 3: Descriptive Statistics of NRT Item Parameters of Sample VNHSEE 2017 36 Table 4: Summary Statistics of NRT Item Parameters of Sample VNHSEE 2017 37
Table 5: NRT Item Analysis Results of Item 1 and 2 38
Table 6: NRT Item Analysis Results of Item 28 and 40 39
Table 7: Items with low B – index and Item φ at given cut scores 40
Table 8: CRT Item Analysis Results of Item 1 and 2 41
Table 9: NRT Item Analysis Results of Item 4, 28 and 40 41
Table 10: Descriptive Statistics of B – index of Sample VNHSEE 2017 at four cut scores 43
Table 11: Reliability coefficient and SEM of Sample VNHSEE 2017 44
Table 12: Dependability coefficients of Sample VNSHEE 2017 45
Table 13: Items with unattractive distractors on Sample VNSHEE 2017 48
Table 14: Distractor analysis results of Item 20 49
Table 15: Distractor analysis results of Item 32 50
Table 16: Distractor analysis results of Item 45 51
Table 17: Distractor analysis results of Item 28 52
Table 18: Distractor analysis results of Item 25 53
Table 19: Distractor analysis results of Item 2, 9, 23 and 40 53
Figure 1: Histogram of Sample VNHSEE 2017 35
Figure 2: Frequency Polygon of B – index distributions 42
Trang 12List of Acronyms and Abbreviations
α Cronbach’s alpha
λ lambda; cut score
φ item phi; phi correlation coefficient
Φ index of dependability
Φ(λ) Phi(lambda)
B, B – index Brennan’s B – index
CRT criterion – referenced test, criterion – referenced testing CTT classical test theory
IF item facility/ difficulty
MCQ multiple choice question
Mdn median
Mo mode
MOET Ministry of Education and Training
NRT norm – referenced test, norm – referenced testing pb(r) point – biserial item discrimination index
SD standard deviation
SEM standard error of measurement
VNHSE Vietnam National High School Examination
VNHSEE Vietnam National High School English Examination
Trang 13CHAPTER 1: INTRODUCTION
1.1 Overview
Testing and assessment are an indispensable process that drives teaching and learning process – with the use of content knowledge, test format and subsequent feedback (Cooke et al., 2006) Among ample assessment tools, Multiple Choice Questions (MCQs) are widely applied to formative and summative assessments in virtually academic fields, especially in high-stakes tests In relation to test reliability, the quality of test items has been under scrutiny for its efficacy in test revision and development This paper is conducted with a view to evaluating the reliability of Sample Vietnam National High School English Examination 2017 (VNHSEE 2017)
by investigating the quality of its test questions
1.1.1 Vietnam National High School Examination
Since tertiary education is deemed to be the identification of ‘propensity for student success’ (Palmer, Bexley & James, 2011), university and college admission plays an important role as a gatekeeper There are sharp distinctions in relation to selection methods among countries and jurisdictions, contingent on their own educational contexts (UNESCO, 2015) Whilst universities and colleges in the US,
UK and Australia, etc require application packages of standardized tests’ scores, official transcripts, personal statements, reference letters and resumes (Andrich & Mercer, 1997), those in Vietnam base solely on a high school diploma and academic
performance in Vietnam National High School Examination (VNHSE)
VNHSE has been organized annually around mid-Jun by the Ministry of Education and Training (MOET) since 2015 Before VNHSE, Vietnamese students used to take two discrete nationwide exams, called High School Graduation Examination and University Entrance Examination Therefore, the results of VNHSE are utilized for both high school graduation results and university admission decisions
Trang 14(Decision No 3538/QĐ-BGDĐT, MOET, 2014) Candidates are required to take, in total, six tests on Math, Literature, a foreign language (English, French, Russian, Japanese, Germany or Chinese) and one combination of subjects for intended major (Social subjects: History, Geography and Civic Education; Science subjects: Physics, Chemistry, and Biology) Cut scores are discretely predetermined by universities and colleges to distinguish between passing and failing test takers These cut – off lines are set higher than the minimum required by the MOET (Tran, Griffin & Nguyen, 2010) and vary among different institutions and majors Eventually, universities and colleges rank prospective students’ results on VNHSE in descending order and utilize cut scores as a tool to select students
1.1.2 Vietnam National High School English Examination 2017
In October, 2016, MOET published the Sample Vietnam National High School English Examination (VNHSEE) 2017 with some changes in test format, compared
to the official versions administered in 2015 and 2016 In general, the total number of questions is condensed from 70 to 50 and the time limit is reduced from 90 to 60 minutes The official versions of 2015 and 2016 examination utilized both selected and constructed response tasks (64 multiple choice questions, 5 rewriting items and 1 paragraph writing task) For Sample VNHSEE 2017, it eliminates the constructed response tasks in evaluating writing ability and solely utilizes multiple choice items
to assess four language areas: phonetics, grammar, vocabulary and reading
According to Vietnamnet (2017), the number of applications for VNHSE 2017 has increased to 859,835 This massive statistic illustrates its nationwide influences
on teachers and learners Given the amendments to testing policies by MOET, VNHSEE 2017 provides information to both validate students’ graduation results and contemplate their university admission status Accordingly, it directly influences educators and students on a larger scale than it used to, from 2014 backwards Furthermore, many universities and colleges, in fact, double the English test results
Trang 15before adding in other test scores to compute test takers’ total scores This invokes the fact that the English test can outweigh other subject tests
In this paper, Sample VNHSEE 2017 is classified as both norm – referenced and criterion – referenced test, owing to its special score – interpreting and decision – making process For the purpose of a placement test as University Entrance Examination, together with other two subject tests, test taker’s scores on VNHSEE
2017 are computed and utilized for ranking students in descending order By comparing an examinee’s result to others’ performance on the test and selecting top students, university and colleges are making relative decisions for their admission process On the other hand, with the aim of validating students’ high school graduation results, MOET predetermines cut score of 1 (out of 10) on VNHSEE 2017 to separate passing and failing test takers This means absolute decisions are made on how well candidates performed on the test in relation to the cut score, not others’ results
In short, VNHSEE 2017 is basically a matriculation test, categorized as both NRT placement test and CRT achievement test
1.2 Problem Statement and Rationale for the Study
Although paid attentions later than Language Teaching Methodology, Language Testing and Assessment has been presenting innovations and experimentations in furtherance of progressive theoretical advancement over the last century The sphere has witnessed significant maturation since the advents of literature works by Spearman, Lado & Harris in 1950s, followed by test theories by Novick in 1966, Lord, Rasch & Lazarsfeld in 1960s These pioneering contributions are the foundation for the development of assessment types on the ‘academic’ and
‘scientific’ footing nowadays Assessment tools are thus deemed vital and influential
to the interface between learning and testing (McInerney, 2013) as they assist in making judgments on examinees’ learning process and in return, on task design (Mislevy, 2007)
Trang 16The significance of classifying and applying tasks formats has long been emphasized because deciding building materials always comes before conducting blueprints for a whole test (Carr, 2011) Among all the task formats, MCQs are the
‘best recognized’ (Carr, 2011) and most widely used in high-stakes testing (Haladyna and Downing, 1989) The perception of many test item writers that crafting MCQs does not require much time and effort is partially correct In fact, they do not need to devote much to designing ‘poor’ items (Carr, 2011) According to Cangelosi (1990), Haladyna, Downing & Rodriguez (2002), there are principles and guidelines that test item writers should follow to maximize the quality of multiple choice items Once the questions have been conducted and reviewed, pilot testing is recommended to evaluate if the test is functioning in a desired manner (Carr, 2011) More importantly, with the statistics of the pilot test, administrators can investigate the quality test questions to revise and develop items in ‘official’ versions
Irrespective of its significance, item analysis is only frequently carried out for high-stakes tests such as Test of English as a Foreign Language (TOEFL), Test of English for International Communication (TOEIC), International English Language Testing System (IELTS), etc For example, Kostin (2004) reveals the correlations between lexical gaps and item difficulty in TOEFL dialogue listening part and similar conclusion were made in TOEFL mini-talks (Freedle & Kostin, 1999) and TOEFL reading (Freddle & Kostin, 1993) IELTS calibrates items using ‘anchor items’ from generated item analysis (Anderson, Caroline & Wall 1995) to control the stable test reliability Cambridge English Language Assessment (2015) considers item analysis
an element of the quality assurance process In addition, based on item analysis statistics, Cambridge English Language Assessment revises ‘problematic’ items in Cambridge English: Key Exam and utilized item analysis as a tool for the first/second iteration to ‘appraise’ test items Last but not least, Seoul National University Criterion-Referenced English Proficiency Test (SNUCREPT) is validated by the mean discriminability indices (Choi, 1994) In comparable to TOEFL and TOEIC, it
Trang 17also has the index over 35, which illustrated its adequate power of discrimination One more important result of this paper is the distractor analysis – one part of item analysis It finds out the relationship between the topics of reading passages and difficulty index, which is ‘the dilemma in constructing validity’ (Choi, 1994)
In Vietnam, equivalent research has not been widely conducted to assess test reliability, compared to aforementioned worldwide high-stakes tests VNHSE – one
of the topmost high-stakes tests in Vietnam – has been launched for two years; however, annual changes in test format of English papers pose the question of test quality Despite considerable efforts in validating VNHSEE, both before and after the reformation, a limited number of item analyses and studies on test qualities have been available to public scrutiny, except for statistics of test-takers, testing centers and measure of central tendency (MOET, 2015)
Some independent objective research has been carried out such as Tran, Griffin
& Nguyen’s (2010) on the framework and methodology to validate interpretation and use of Vietnam University Entrance English Examination (VNUEE) 2008 for university selection and Ngo’s (2010) on an achievement test at a university Tran, Griffin & Nguyen (2010) have conducted research on approach to validating interpretation and application of VUEE 2008’s English test scores for university selection, employing Messick’s (1989) validation framework with evidence of six aspects of validity (namely content, substantive, structural, generalizability, external and consequence aspects) VUEEE 2008 is reported to be a sound measure of English language ability, which correctly assessed students for undergraduate admissions However, the data for this paper is out-of-date, in terms of test format Therefore, concerning item analysis for test development and item revision, research in VNSHE has not been conducted and published widely yet
Another study was conducted to evaluate the reliability and validity of an English achievement test for third-year students at the University of Technology, Ho Chi Minh National University by Ngo (2010) The given achievement test had
Trang 18problems with item difficulty index: roughly 70% of the test items were noted too difficult or too easy to the test-takers Besides, this test also possessed poor discriminability (Ngo, 2010) Item parameter indicates a key error in the test (an attractive distractor, an extremely difficult or easy level) Overall, Ngo (2010) notes the issues need improvements and proposes suggestions for future tests However, this paper was limited to classroom settings, not applied for high-stakes tests
In a nutshell, item analysis plays an indispensable role in examining test qualities, yet neglected for Vietnam high-stakes tests in general and VNHSE in particular Most recently, MOET has released Sample VNHSEE 2017 with revised test format, which grasps public attention immediately, from test takers to other stakeholders such as university/college boards, teachers, parents and even policy makers Hence, this thesis on Sample VNHSEE 2017 is expected to delineate an overall picture of the quality of test questions and test reliability as well as propose some suggestions to MOET in case some revisions might be in need of
In short, this research is designed as an investigation into the quality of test
questions in the Vietnam National High School English Examination 2017
1.3 Significance of the Study
This investigation elucidates test reliability and item analysis of Sample VNHSEE 2017, applying Classical Test Theory (CTT) Its outcome would be beneficial to test – takers, educators and other stakeholders (institutions, parents and other researchers interested in this field, etc.)
First and foremost, this study is initially carried out with the hope of providing examinees and teachers a general description of the revised test format Accordingly, rational adjustments in pedagogic methodologies by both learners and teachers would
be advisable to approach the new test format
Moreover, MOET and university/college’s admission committee are also reported about the NHSE test reliability in order to ponder the admission procedure
Trang 19Suggestions for changes and implications could hopefully assist in making commensurate justifications for higher quality of the upcoming tests
Last but not least, at literature level, the expected outcomes of this paper would exert multiple implications regarding practices of testing and assessment in Vietnam, where educational research will need significant more focus on in the future In addition, generated indices (reliability, difficulty, discrimination and distractor attractiveness, etc.) can be a reliable source of reference for further research in this field
1.4 Objectives of the Study and Research Questions
The primary purpose of this investigation is to identify the quality of reliability
of Sample VNHSEE 2017 Secondly, the quality of test items and the efficacy of distractors are also detected To be more specific, this thesis aims at answering two following questions:
Research Question 1: What is the reliability of Sample VNHSEE 2017?
Sample VNHSEE 2017 is classified as both norm – referenced and criterion – referenced test; accordingly, item analysis with Classical Test Theory was developed
on a basis for both test types Research question 1 was subdivided into three questions
to discretely yield the targeted answers
Research Question 1.1: What are item parameters of the sample test?
Research Question 1.2: What are NRT reliability statistics of the sample test? Research Question 1.3: What are CRT dependability statistics of the sample test?
Research Question 2: What distractors in Sample VNHSEE 2017 are in need of revision?
1.5 Organization of the Study
Trang 20This thesis is divided into 5 chapters Chapter 1 introduces the research topic and rationale underlying it Chapter 2 includes theoretical background related to MCQs, CTT, Item Analysis and previous relevant studies/ item analyses on MCQs Chapter 3 clarifies employed methods with details of selection of participants, research instruments, procedures of data collection and data analysis Chapter 4 reports findings and discussion Chapter 5 concludes the investigation by summarizing empirical findings, stating implications, identifying limitations of this study and offering suggestions for further studies
Trang 21CHAPTER 2: LITERATURE REVIEW
2.1 Multiple Choice Questions
2.1.1 Definition and Use of Multiple Choice Questions
A multiple choice question (MCQ) is a practice of selected response tasks, which requires test-takers to decide on the correct answer from a set of several
alternatives It consists of two elements: stem and options Stem is the part where the
requirement is posed and it can be in any format, such as: questions, problem statement, etc Options, also known as alternatives, responses and choices, are possible answers that examinees have to decide on Among the suggested options,
there are a correct answer called key and other incorrect answers called distractors or
foils (Nitko, 2001) The purpose of distractors is to present plausible information to
lure test takers with inadequate understanding or knowledge away from the correct answer
According to Haladyna & Downing (1989) and Carr (2011), compared to other test formats of selected response tasks (True – false, Matching and Ordering questions), MCQs achieve topmost popularity, especially in high – stakes and standardized tests, owing to five following merits
First and foremost, MCQs is versatile as they can be utilized for various learning objectives and levels In contrast with the criticisms of their triviality (Hoffman, 1967) and disability to measure test takers’ divergent thinking (Dudley, 1973), multiple-choice formats can assess ‘taxonomically higher-order cognitive process’ of comprehension, analysis, synthesis and application, in addition to recall
of isolated knowledge (Nasir, 2014)
Besides, thanks to its nature of providing examinee with a fixed set of alternatives to decide on, multiple choice format mitigates test takers’ possibilities of
‘bluffing’ or ‘dressing up’ their responses (Wood, 1977)
Trang 22Thirdly, compared to other assessment formats, effective MCQs assure test validity since they can test and measure a considerable amount of knowledge in a shorter time (Fowell & Bligh, 1998) As a result, overall achievement of test takers can be indicated more appropriately and meaningfully from score interpretations
Moreover, distractors chosen by students serve as small diagnostic tests for teachers to determine their students’ strengths and weaknesses However, diagnosing students using solely this way does not guarantee a completely trustworthy result so teachers should only use it as a foundation to further and confirm their diagnosis
In addition, test reliability indices of MCQs are relatively high because of their format and their marking process (Burton et al., 1991) To be more specific, multiple choice formats, consisting of stems and alternatives, are less affected by guessing factors than True – False items or partial answers than Short Answer questions Besides, computerized marking process mitigates the factor of subjectivity and scorer inconsistencies Its rapidity and efficacy thus merits the preference of MCQs in testing and assessment (Burton et al., 1991)
On the other hand, Wood (1977) summarizes four major criticisms of multiple choice items, although following reproof can be applied to any assessment formats It
is also unfair to criticize multiple choice tasks for all below flaws because it may appear along with the way teachers administer tests or marking rubrics First of all, MCQs limit students to the opportunity of expanding their responses to tested knowledge because they are obliged to decide from a firm set of alternatives Secondly, poorly designed MCQs cannot access factual and valuable knowledge and
be tedious if continuously adopted This triggers the next criticism of multiple choice items, which lies in their format Specifically, due to its fixed format of a key and more than one distractors, advanced students can utilize tips to detect distractors by ambiguous wording, unrelated information or diverse statements, whilst others cannot Finally, tested knowledge using MCQs can be limited and inauthentic The issues students are obliged to cope with are often offered with solely one remedy
Trang 23(there is one correct choice) Consequently, students can be under the misconception
of real- world context Moreover, the abuse of this test format in high – stakes testing may orientate education in a negative way and exert washback effects pedagogical methodologies For instance, teachers tend to adopt drill – and – practice teaching method to prepare students for the test; however, this is less impactful when it comes
to assessment of profound knowledge and high – level skills
2.2 Classical Test Theory
From 1904 to 1913, Charles Spearman published his studies of fallible
measures and true objective values His effort to correlate these two terms paves the
way for Classical Test Theory (CTT) Many testing experts such as Guilford (1936), Gulliksen (1950), Magnusson (1967), and Lord and Novick (1968) have restated and developed this theory into the model that is widely used nowadays
CTT concentrates on two elements of any observed score, which are a true score and a random error component, as expressed in the formula: X = T + E, where
X represent the observed score, T is the true score of one individual and E is a random error component That difference between X and T can be caused by mismarking, guessing or question misreading, etc Irrespective of either positive or negative influences E exerts on test takers’ result, it is inaccurate to appraise their true score, basing on an accurate number of correct answers they can get in the test
Standard Error of Measurement (SEM), which delineates the proportion of true score variance, is another essential concept of CTT It focuses on measuring error and obtaining the relevant index of reliability to understand and improve the reliability of tests, specifically in optional questions (McAlpine, 2002) SEM is calculated with the formula:
SEM = 𝑠" 1 − 𝑟 where 𝑠" represents the standard deviation of total test scores and r is reliability (Cronbach’s α is used in this paper) Although there is no precise estimate of error for
Trang 24each score, SEM provides testing experts with a confidence interval to indicate the expected variation of each examinee observed scores They can be 68% confident that
a student’s true score lies in the interval of one SEM of his or her observed score (+/-
1 SEM) and 95% sure that his or her true score is within 2 SEMs of the observed score (+/- 2 SEMs) (Carr, 2011) They can also be over 99% confident that a test taker’s true and observed score will be no more than three SEMs apart
CTT examines test statistics at both item and test level At the item level, item facility and item discrimination are two important concepts The item facility/difficulty, abbreviated IF, illustrates the proportion of the test takers who answered the item correctly The item discrimination index indicates the efficiency of each item in separating students with high and low performance At the test level, CTT focuses on the reliability coefficient of total test scores on parallel test forms (Crocker & Algina, 2008)
2.3 Item Analysis in Classical Test Theory
2.3.1 Overview of Item Analysis for Selected Response Tasks
Item analysis assists test developers with evaluating item performance and identifying problematic questions Each item is examined in three aspects: how difficult it is, how effectively it distinguishes between high- and low – scoring test takers and how well each distractor performs They are called item difficulty, item discrimination and distractor analysis, respectively All of these three terms are further discussed in 2.3.3 and 2.3.4, below are general descriptions of them
In classical test analysis, item difficulty/facility is abbreviated as IF or p –
value This index is utilized to measure and evaluate the level of difficulty of items in norm – referenced tests (NRTs) For criterion – referenced tests (CRTs), difficulty value is not analogously applied, and thus items should not be reported as problematic solely because of their unsatisfactory difficulty value
Trang 25Item discrimination for NRTs and CRTs both apply correlational and
subtractive approaches; however, it is expressed as different sets of indices Each item
in NRTs has two discrimination values: point biserial for correlational method and upper – lower discrimination for subtractive approach CRT discrimination values are also computed as either item φ for correlational approach or B – index for subtractive approach However, item discrimination for CRTs are cut score – dependent Cut scores are the cut off line to separate test takers into two or more groups (passing or failing), levels (basic, intermediate or advanced) or classes (A, B, or C) Accordingly,
if a given tests has several cut scores, each item will also possess several discrimination values
Distractor analysis for selected response tasks concerns distractor attractiveness (option response frequency) and option point biserial to evaluate distractor performance on each item
2.3.2 Item Analysis and Bachman & Palmer’s Qualities of Test Usefulness
Framework (1996)
2.3.2.1 Bachman & Palmer’s Qualities of Test Usefulness Framework (1996)
In stark contrast with early theory which pressed for maximization of all discrete test characteristics to achieve overall usefulness, Bachman and Palmer (1996) advocates an appropriate balance among test characteristics This balance can be varied from one testing practice to another Furthermore, they develop this notion with their Qualities of Test Usefulness Framework (1996), specifying six test qualities and providing three principles for language test design and development
Overall, test usefulness is the combination of six complementary qualities: reliability, construct validity, authenticity, interactiveness, impact and practicality According to Bachman & Palmer (1996):
Trang 26• Reliability refers to consistency of measurement, regardless of different testing settings, various sets of test task characteristics and diverse rating styles among raters
• Construct validity is the extent to which test scores can be interpreted and generalized as an indicator for language testers to measure targeted language ability
• Authenticity pertains to the degree to which test tasks are homologous with those in particular TLU domains or other nontest language use domains
• Interativeness refers to the relation between test takers and their language ability, topical knowledge and affective schemata
• Impact is taken into consideration in test design and development as tests can exert substantial influences on or consequences for score interpretations and decisions
• Distinctive in nature from five aforementioned test qualities, practicality refers to the ways of administering and developing tests as well as resources required for implementing them
Three principles are the foundation to analyze the overall test usefulness in furtherance of language tests Principle 1 reminds language testers of the goal that needs maximizing – the overall test usefulness, not the discrete qualities as in early theory Although these characteristics differentiate from one another, they are operationalized in an interrelated way and must be assessed ‘in terms of their combined effect on the overall usefulness of the test’ (Bachman & Palmer, 1996:18) According to Principle 3, there is no standard prescription regarding appropriate balance among test qualities because it must be diagnosed in each particular testing situation In summary, in order to design and develop a useful test, languages testers must have a clear purpose, aim at a specific group of examinees and specify target language use domain beforehand (TLU domain) (Bachman & Palmer, 1996)
Trang 27In examining the overall usefulness of a language test, it is crucial to study each particular quality in a systematic view However, this paper is confined to only the quality of reliability since it correlates with item analysis in CTT (further explanation in 2.3.2.3 Therefore, only test reliability is further mentioned in this research
2.3.2.1.1 Test Reliability
Reliability is the consistency of measurement, indicating the extent to which test scores are consistently achieved in various testing practices For example, if a given test is delivered to the same group of test takers on two distinctive occasions, in two distinctive settings, it should not make any difference to a particular examinee’s experience Another instance of a reliable test occurs when language testers develop and administer two discrete, yet interchangeable, tests and obtain the same outcome from a specific test taker When it comes to two replaceable tests that are designed to rank individuals or separate them into high and low groups, scores obtained from either form must arrange test takers in particularly the same order Otherwise, the tests are deemed to be unreliable indicators of the targeted language ability for measurement and assessment Last but not least, the inconsistency among raters also impacts on the reliability of test scores since a test composition should obtain the same scores from different raters In fact, some raters are severe and others are lenient in marking a same given paper, which results in inconsistent measurement Consequently, unreliable test scores is a foregone conclusion
Test reliability value is often mathematically computed and numerically reported Cronbach’s Alpha (α) and Cohen’s Kappa (κ) are the most widely utilized coefficients Whilst the former indicates the internal consistency reliability, the latter measures inter – rater reliability Standardized and high – stakes tests generally achieve a particularly high reliability coefficient of beyond 80, demonstrating that they are highly reliable For instance, the reliability value of IELTS was reported at
Trang 28.90 and 91 for reading and listening in 2010 (Taylor & Weir, 2012) TOEFL iBT also obtained a desirable reliability level of 94 overall in 2007 (ETS, 2008) Gaokao’s English tests’ overall reliability were estimated at 94 in 2010 and 2011 (Wang, Huang
& Schenell, 2013) In the case of high – stakes tests in Vietnam, the old format of VNHSEE, called Vietnam University Entrance English Examination (VUEEE) in
2008 was investigated and revealed to be reliable by Tran (2014) However, changes
in test format since 2015 have made these statistics rather outdated Recently, Bui (2016) compares internal reliability between the old and the revised test format of VUEEE 2014 and VNHSEE 2015 The results reveal that the amended test format in
2015 achieved a reliability estimate of 88, which is higher than that of the traditional format (.86) in 2014 On a smaller scale, there are several investigations into test reliability conducted with achievement tests in universities For example, the reliability of the reading achievement test for sophomores at Faculty of English Language Teacher Education, University of Languages and International Studies, Vietnam National University is reported to be within acceptable range of 70 (Vu, 2014) In addition, the English achievement test for third – year students at a university in Ho Chi Minh City possessed a poor internal reliability, at 53, which is undesirable even for low – stakes tests
Reliability is crucial to rating and analyzing test scores as they supply language testers with further information regarding the targeted skill and ability To improve the quality of reliability, language testers should mitigate the impacts of inconsistency
of measurement through test design and development, rather than attempting to entirely abandon them Moreover, measurements of a limited range of constructs or theta are strongly recommended for test reliability enhancement (Bachman & Palmer, 1996) Hughes (2003) also suggests an ample number of items, clear instructions, well – designed items, interpretable scoring key and familiar test format strengthen test quality of reliability
Trang 29It is apparent that most of the published studies regarding test quality in Vietnam were limited to low – stakes testing practice in several specific universities Some papers have addressed VNHSEE, however with the old test format Accordingly, more research into test quality of Vietnam’s high – stakes tests in general and VNHSEE 2017 with its revised test format in particular is essential
2.3.2.2 Item Analysis and Bachman & Palmer’s Qualities of Test Usefulness
Framework (1996)
Considering Bachman and Palmer’s Qualities of Test Usefulness Framework (1996), item analysis fits into the quality of reliability When language testers want to improve reliability of a given test (technically, dependability for a CRT), they can increase the number of items or improve the quality of existing items, or both By
quality, it is measured by item discrimination and, to lesser extent, difficulty Relative
to discrimination, test reliability can be linked to discriminability on a basis of correlation approach Besides, in the pursuit of well-written test questions, item writing professionals have to conduct items with the ability of assessing examinees’ high-order cognitive process of content interpretation, analysis and application Therefore, an investigation of the quality of test questions or an item analysis establishes item difficulty and discrimination indices of every test item for development and revision of overall reliability Downing (2005) proves that test reliability observes an improvement by 10-25% after eliminating all the reported
‘technically flawed’ in an item analysis of multiple choice questions in a given test
As mentioned above, item analysis in CTT addresses two major concepts of item difficulty and discrimination; however, it is conducted with different approaches
in of norm – referenced tests (NRTs) and criterion – referenced tests (CRTs) Given the characterful classification of VNHSEE 2017’s test type, both NRT and CRT item analyses are mentioned
Trang 302.3.3 Norm – referenced Testing Item Analysis
2.3.3.1 Item Difficulty
Item difficulty/facility (IF), reflects the proportion of students answer an item correctly (Carr, 2011) IF technically pertains to the easiness of an item since the higher the index, the easier the item Values can be between 00, where no test taker respond correctly, and 1.00, where all of the examinees decide on the correct answer For dichotomous items, whose score is marked with two variables: 0 for wrong answer and 1 for correct answer, IF is the average of individual scores on that item, presented
as follows:
𝐼𝐹(= ))*where 𝑛( represents the number of examinees with correct response to Item j and n is the total population of test takers
NRTs target distinguishing between test takers with high and low ability and placing them in a bell curve Accordingly, an IF of 50 is ideal in NRTs as it provides information regarding the largest group of examinees – those in the middle of the distribution (Carr, 2011) However, in real testing context, an item with difficulty index from 25 and 75 (De Champlain AF, 2010) or from 30 to 70 (Allen & Yen, 1979) is deemed desirable in identifying students’ various levels of proficiency An item whose IF is either beyond or below this range is considered either too difficult
or too easy However, Carr (2011) cautions that surplus attention regarding IF is probably improper in both CRTs and NRTs
2.3.3.2 Item Discrimination
Item discrimination indicates the extent to which an item distinguishes between test takers with high and low ability In NRTs, this index refers to the correlation between a student’s performance on an item and his or her overall performance on the test Irrespective of their similar implications, there are two methods of estimating
NRT discrimination values: subtractive and correlational approach
Trang 31With the subtractive approach, item discriminability is calculated by the
difference between the IF of a high – scoring group and that of a low – scoring group Typical recommendations for opting for the size of the upper and lower groups on the overall test are the top and bottom 20%, 25%, 27% or one third Testing experts sometimes can try between from 20% to 33% to estimate an even number of examinees However, whilst too large groups signify the difference in ability of the students with high and low performance, too small groups inflate the discrimination indices The subtractive approach for NRTs discrimination values, abbreviated 𝐼𝐷./,
is calculated with the formula: 𝐼𝐷./ = 𝐼𝐹12234 - 𝐼𝐹56734
Regarding the correlational approach, which is employed in this paper,
discrimination values illustrate the ‘correlation between the item and the total test score’ (McAlpine, 2002) The most frequently used correlational discrimination index
for NRTs is point – biserial (𝑟289:;, 𝑟29:;, r(pb) or pb(r)) pb(r) is mathematically a Pearsonian correlation coefficient and is computed with Brown’s (1988) formula:
𝑝𝑏(𝑟) = AB 8AC
DE 𝑝𝑞 where pb(r) represents point – biserial correlation coefficient, 𝑀2 is whole-test mean for examinees with correct answer to the item, 𝑀H is whole-test mean for examinees with incorrect answer to the item, 𝑆J is standard deviation for the whole test Standard deviation is how the score is different from the mean 𝑆J is computed by square-root
of the sample variance, p is proportion of examinees with correct answers and q is
proportion of examinees with incorrect answers
Unlike item difficulty, discrimination index ranges from -1.00 to 1.00 pb(r) cannot exceed 1.00 (perfect discrimination) but the closer to that, the better the item (Carr, 2011) Carr (2011) provides the following guideline to interpret discrimination index:
• If 30 ≤ pb(r) ≤ 39, little to no item revision is needed
• If 20 ≤ pb(r) ≤ 29, item revision is necessary
Trang 32• If pb(r) ≤ 19, either the item should be completely revisited or eliminated
A common misconception when interpreting discrimination statistics is to apply the IF interpretation guideline to the pb(r) This confusion happens when language testers and test developers bring an item with too high index into focus as it violates the rule of thumb Items can be too difficult or too easy, but never has too high discrimination However, interrelated issues are raised when it comes to discrimination value near or below zero On the one hand, if index of correlational approach (pb(r)) or subtractive approach (upper – lower discrimination) is positive but close to zero pb(r) or marginally negative, this demonstrates that the item does not serve any purpose of increasing the reliability of the test (Carr, 2011) On the other hand, a large negative discrimination value defines that item as an active harm to the test reliability as the statistic indicates high – scoring test takers did more poorly on that item than low – scoring students In this case, test developers may need to check whether the scoring key is correct, or the item has two keys, or that item is tricky or ambiguous in some ways (Carr, 2011)
2.3.3.3 Reliabity
Reliability refers to the consistency of scoring in NRTs As mentioned in the overview of CTT, examinees’ received scores (or observed scores) consists of their true scores and measurement error Accordingly, reliability estimate how much the observed score contributes to the true score or to examinees’ ability, not to the error component (Carr, 2011)
Dissimilar to test – retest reliability and parallel forms reliability, internal
consistency reliability is applied in this paper because it requires one single test
administration This approach measures whether different sections of the test assess
the same thing (Spearman – Brown prophecy formula for split – half reliability) or
Trang 33how consistently they measure the target knowledge (Cronbach’s alpha (α) for interal
consistency reliability) Cronbach’s alpha is estimated with the following formula:
𝛼 = M8NM (1 − ;OP
;Q ) where k is the number of items on the test, 𝑠:R is the population variance for an individual item on the test, 𝑠:R is the sum, or total, of all these item variances, and
𝑠"R is the population variance of the total test score The rule of thumb of reliability in
a high stakes test should be more than 0.80 Even for a low stakes test as, for instance
in classroom practice, a reliability coefficient of 0.70 or lower is undesirable since at least 30% of a student’s score result from measurement error (Carr, 2011)
Besides overall consistency of scoring, the influences of measure measurement error on test takers’ performance also needed to be examined Measurement error as
a component of an examinee’s observed score results in the difference between his or her received score and true score Accordingly, the more reliable the test is, the closer these two scores become Standard error of measurement (SEM) is estimated to provide test developers with a confidence interval To be more specific, they can be 68% confident that test takers’ true scores will belong to the range of one SEM of their observed scores (+/- 1SEM), 95% sure that their observed scores lie in the interval of SEMs of their observed scores (+/- 2SEMs) and over 99% certain that these two scores are no more than 3 SEMs apart from each other SEM is expressed in the formula:
SEM = 𝑠" 1 − 𝛼 where 𝑠" is the standard deviation for total test scores and α is Cronbach’s alpha for internal consistency ability
Trang 342.3.4 Criterion – referenced Testing Item Analysis
Although item analysis for both norm – referenced tests (NRTs) and criterion – referenced tests (CRTs) examine item difficulty and discrimination, their different nature as well as purpose result in their different indices and interpretation guidelines
In CRTs, due to its difficult estimation in practice, IF is reported as an additional index to interpret other item analysis results, not a determinant in selecting, reviewing or eliminating an item For their distinctive types and purposes, there is no precise interpretation guideline for IF index in CRTs
2.3.4.2 Item Discrimination
As mentioned above, item discrimination in NRTs presents the relationship between test takers’ performance on an item and their overall test scores However, discrimination values for CRTs indicate the correlation between item performance and mastery/non – mastery at one or more cut scores (Carr, 2011) In case of multiple cut scores, discrimination indices must be discretely computed for each different cut scores because different cut – off lines alter the standard of masters and non – masters
Trang 35For CRTs, item discrimination is also calculated with two approaches: correlational and subtractive
Regarding the correlational approach, as long as the absolute decisions are dichotomous (pass/fail, master/non – master, etc.) and test items are also dichotomous (0 for incorrect response and 1 for correct response), the φ coefficient or item φ (Glass
& Stanley, 1970) is estimated and analyzed as the correlational discrimination index for CRTs Item φ is a Pearson correlation and can be expressed in the following formula (Brown & Hudson, 2002):
φ = SSOT8 SOST
O UOS T U T
where 𝑃: represents the proportion of examinees with correct answer to the item, 𝑄:
is the proportion of examinees with incorrect answer to the item (𝑄: = 1 – 𝑃:) , 𝑃X is the proportion of examinees who were above the cut score, 𝑄X is the proportion of examinees who were below the cut score (𝑄X = 1 – 𝑃X), 𝑃:X is the proportion of examinees who answered the item correctly and were above the cut score Carr (2011) recommends that if an item φ ≥ 0.30, that item should be deemed desirable
There are two categories of subtractive discrimination values to separate between masters and non – masters in CRTs: the difference index (DI) and the B – index Due to DI’s prerequisite of twice test administration, this paper applies B – index with one single test delivery The position of test takers’ total test scores in relation to the cut score(s) determines the mastery and non – mastery groups As a result, if more than one cut scores are predetermined, CRT discrimination index needs
to be separately estimated with the following convention:
B – index = 𝐼𝐹YZ;J34; − 𝐼𝐹)6)8YZ;J34;
where 𝐼𝐹YZ;J34; and 𝐼𝐹)6)8YZ;J34; represents item facility of masters and non – masters, respectively Carr (2011) provides a B – index interpretation guideline, which suggests that B – index of the subtractive approach should ≥ 0.40
Trang 36In real testing context, only one of these two approaches is applied and only one of these two indices therefore is reported However, in some cases, as in this study, both item φ and B – index are estimated to examined whether they provide conflicting
or consonant information regarding the items One essential point is that an item with
a desirable discrimination value at one cut score does not necessarily perform that efficiently at other cut scores (Carr, 2011)
2.3.4.3 Dependability
Dependability for CRTs is technically equivalent to the terminology reliability for NRTs However, dependability is separated into consistency of scoring and
consistency of classification
In terms of consistency of scoring, this coefficient resembles reliability for
NRTs, but differs in its influences on score – interpreting and decision – making decision process Relative decisions in NRTs stress ranking students and comparing each test taker’s performance to others’ Nevertheless, absolute decisions focus on how well each examinee perform and how much his or her theta is indicated in the test, without reference to other test takers Carr (2011) highlights that a test’s dependability is often at least marginally below than its reliability For dichotomous
data, dependability index of consistency of scoring, is calculated with the formula:
Φ =
[\P []^ _
[\P []^ _`a ^]a ]\Pb]^
where n represents the number of test takers, 𝑠R is the variance of test scores in proportion – correct metrix with the population formula, α is the CRT test reliability coefficient, M is the mean of the proportion scores and k is the number of items on the test
To measure the confidence interval as SEM in NRTs, CRT confidence interval,
abbreviated as 𝐶𝐼deX, is expressed in Brown & Hudson’s (2002) convention, using the same statistics as in Φ formula:
Trang 37𝐶𝐼deX = 𝑀 1 − 𝑀 − 𝑠R
𝑘 − 1Although the result of 𝐶𝐼deX is comparably interpreted as SEM, regarding 68% and 95% confidence intervals of test takers’ observed scores, its implication is dissimilar, regarding ultimate decisions on test takers’ scores (relative or absolute decision)
Regarding consistency of classification, CRTs serve the elementary purpose of
classifying students as masters and non – masters, as being above or below a certain predetermined cut off scores Classification consistency in curriculum – related testing is categorized and estimated as squared error loss agreement indices Φ(λ) (Brennan, 1992) – the most popular index – is computed with the following formula:
Φ(λ) = 1- M8NN [A N8A 8;(A8h)P`;PP] where M, 𝑠R and k are the same statistics as in Φ formula, λ represents the cut score Given that the cut score (λ) is included in the formula, Φ(λ) index is thus cut – score dependent Accordingly, similar to CRT item discrimination B – index, if a test has more than one cut scores, Φ(λ) must also be separately calculated for each cut score Carr (2011) emphasizes that the further away from the mean Φ(λ) is, the higher the degree of classification consistency is
2.3.5 Distractor Analysis
Distractor analysis is the procedure of statistically estimating and evaluating how effectively alternatives on a selected response task performed so that test developers can revise those options This paper examines two indices to identify
which options are the candidate for review: distractor attractiveness (Bachman, 2004) and option point biserial
To scrutinize the level of attractiveness of each foil, Bachman (2004) suggests that, as a rule of thumb, the minimum response frequency of 10% of the population
Trang 38should select each distractor If any distractor is so unattractive or implausible that no
or almost no test takers decide on it, that distractor does not serve any purpose of discriminating students and should be the top priority for revision (Carr, 2011) It is important to bear in mind that the response frequency each option is initially the same
as its item facility (IF), which is converted from a proportion to a percentage
Similar to item discrimination index, option point biserial is calculated to examine how well each distractor discriminates test takers This process is conducted
by treating each distractor as key and correlate that question with overall test score A distractor should possess a negative point biserial, which indicates test takers who choose that wrong option are likely to obtain lower total scores, and vice versa (Carr, 2011) Any distractor with close to zero discrimination value should also be revised because it attracts the same – or almost the same – number of test takers with high and low total test scores In other words, that distractor does not assist in separating examinees with different levels A distractor with a positive point biserial coefficient (more than 05) is also considered a candidate for revision as it appears more plausible
to high – scoring students than to low – scoring ones (Carr, 2011)
In conclusion, aforementioned computations and analyses are of equal importance to thoroughly examine the quality of test items as well as test reliability However, in the context of Vietnam, the qualities of language tests overall have not received deserving attention yet A limited number of previous studies addressing this issue, especially in VNHSEE, are mentioned in 1.2 and 2.3.2.2 Some of them have not accomplished all the analyses or on a limited scale, others have not been available
to public scrutiny Furthermore, few papers on reliability and the quality of test questions have been conducted, irrespective of its significance and nationwide impacts All these research gaps thus catalyze this thesis to discover the quality of test items and test reliability of Sample VNHSEE 2017 – the test of topmost importance
in Vietnam
Trang 39CHAPTER 3: RESEARCH METHODOLOGY
3.1 Research Setting
This research was conducted during the time Sample VNHSEE 2017 had just been published This is the first time MOET adopts this amended test format, which draws significant attention from the public, especially test-takers and educators
Proportional and simple random sampling methods of probability sampling strategy were applied to select samples from 475 grade – 12 – students at Hanoi – Amsterdam High School Among the total population, regarding English proficiency, there were 75 masters from two English classes and 400 non – masters from other eleven classes with different specializations (28 in Chinese, 24 in Russian, 55 in French, 61 in Math, 58 in Physics, 50 in Chemistry, 29 in Computer Science, 29 in Biology, 26 in Literature, 21 in History and 19 in Geography) The quota sampling method determined that the sample group would consist of 32 masters and 168 non –
Trang 40masters, with the population in each stratum proportional to the real population (32:12:10:23:26:24:21:12:12:11:9:8 = 75:28:24:55:61:58:50:29:29:26:21:19) For each stratum, simple random sampling was accomplished in three steps First, the Sample NHSE 2017 was delivered to the entire target population One Yes/No question was added at the beginning of the test paper to survey whether the students had completed the test before, since the test had been published for over one month till the day the data collecting procedure was carried out Subsequently, some answer sheets with symbols denoting examination offences (cheating, using dictionary or discussing with other candidates, etc.) or with answer ‘Yes’ to the supplement survey question were excluded Random subsamples were eventually selected from the remaining answer sheets to form two subgroups as masters and non-masters which were comparable to the proportion of the real population, as the data for this paper
3.3 Data Collection
3.3.1 Data Collection Instrument
A total of fifty MCQs in the Sample VNHSEE 2017 by MOET were utilized
in this research They are all four-option MCQs, covering four featured topics: Phonetics (4 questions), Vocabulary (14 questions), Grammar (12 questions), Reading Comprehension (20 questions) (Appendix 1)
In order to assure the accuracy of the data collecting and analyzing process, a Yes/ No survey question was added to the test paper to certify that the sample group was not exposed to the sample material before the mock test
3.3.2 Data Collection Procedure
The sample test was delivered to the chosen samples in their English class (45 minutes) and one 15 – minute break (if necessary), under the observation of the researcher and their former teachers More importantly, the researcher stated the research purposes and clarified that the results of this test did not affect their academic