04051001872Tóm tắt an investigation into the quality of the final test for second year students at a technical university in hanoi = Đánh giá chất lượng bài kiểm tra cuối kì của sinh viên năm hai tại một trường Đại học kĩ thuật Ở hà nội

Trang 1

VIETNAM NATIONAL UNIVERSITY HANOI

UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES

FACULTY OF POST- GRADUATE STUDIES

NGUYỄN THỊ HOÀI ANH

AN INVESTIGATION INTO THE QUALITY OF THE FINAL TEST FOR SECOND-YEAR STUDENTS AT A TECHNICAL UNIVERSITY IN

HA NOI

ĐÁNH GIÁ BÀI KIỂM TRA CUỐI KÌ CỦA SINH VIÊN NĂM HAI TẠI MỘT

TRƯỜNG ĐẠI HỌC KĨ THUẬT Ở HÀ NỘI

M.A MINOR PROGRAMME THESIS

Field: English Teaching Methodology Code: 8140231.01

HANOI - 2022

Trang 2

ABSTRACT

In the reform of English education and assessment in response to the context of global integration, the English course for Mechanical Engineering students in a technical university in Hanoi (IU) was developed to assess students’ performance This study explored the reliability and content validity of the final test of the course for second-year students via qualitative and quantitative methods The reliability index is calculated based

on Kuder-Richardson Formula 20 Meanwhile, the content analysis is based on (i) Douglas’s framework (2000) to evaluate the relevance and the coverage of the test content compared with the description in the test specification, and (ii) test score analysis

to get a fairly high consistency of the test content with the framework of the test design and the test takers’ performance The findings help confirm the reliability and content validity of the specific investigated test paper However, a need for content review is raised from the research as some problems have been revealed from the analysis

Trang 3

CHAPTER 1 INTRODUCTION 1.1 Rationale

English has been repeatedly proved to be the key to success in most fields in Vietnam nowadays Most companies and organizations include English in the employment requirement However, according to the Ministry of Labor, General Statistics Office of Vietnam & International Labor Organization in 2014 and NESO (2010), many university graduates cannot find a job relevant to their discipline area and more than half of graduates need to be retrained after being recruited This has created considerable concerns over education quality including English skills

In this study, the author focuses on examining the quality of a final achievement test for the following reasons First, the achievement test helps teachers to have some idea of the amount of language that each student attains in a given period of time with very specific reference to a particular program The result of the final achievement test has a decisive role in the result of the subject in each semester; therefore, it must be designed directly linked to the program and objectives Moreover, the development of systematic achievement tests is crucial to the evolution of a systematic curriculum Therefore, the final achievement test assures its important role in evaluation of an English course Besides, among the underlying qualities of a test, reliability and validity are the two most vital ones that constitute a good test (Bachman, 1990) According to Bachman (1990), while reliability is a quality of test scores themselves, validity is a quality of test interpretation and use

At IU (pseudonym for the university in which this study takes place), the course “English for Occupational Purposes” (EOP) has been applied since 2009 for several student generations of several majors (English for Tourism, English for Business) and prove good results that the employers highly appreciate employees who are IU graduates It was not until 2017 that the project for the course “English for Mechanical Engineering” was established

Though proved to be effective, there has been no official evaluation of researchers and lecturers to examine the quality of the final achievement tests for students learning

“English for Mechanical Engineering”- Book 4 in terms of its reliability and content validity Therefore, it is essential to conduct a study on the evaluation of the final

achievement test

Trang 4

With above mentioned reasons, I have decided to choose the research topic: “An investigation into the quality of the final test for second-year students at a technical university in Hanoi” with the intention that the study will be helpful with the author, the

teachers, the test-takers and anyone who is concerned with language testing in general and reliability and validity of an English for Specific Purpose (ESP) achievement test in

particular

1.2 Aims and objectives of the study

The major aim of the study is to evaluate the reliability and content validity of the currently used final achievement tests for the second-year students at IU The specific objectives of the research are:

- to investigate the compatibility in terms of content of the test specification and the test content of the final achievement test as the evidence for the content validity of

the test

- to find out the level of reliability and the statistical evidence for the test’s content

validity from the results of the final achievement test

- to provide some suggestions for improving the test in terms of its reliability and

validity

1.3 Research questions

The study was designed to answer the three following questions:

Research questions 1: To what extent is the content of the final achievement test

compatible with the test specification?

Research question 1.1: To what extent is the content of the vocabulary section compatible with the test specification?

Research question 1.2: To what extent is the content of the grammar section compatible with the test specification?

Research question 1.3: To what extent is the content of the reading section compatible with the test specification?

Research questions 2: To what extent do the final achievement test results reflect its

validity?

Research question 2.1.: To what extent do the test results reflect the planned test difficulty to the students?

Trang 5

Research question 2.2: To what extent do the test results reflect the planned test discrimination?

Research questions 3: To what extent do the final achievement test results reflect its

reliability?

1.4 Scope of the study

Due to the limitations in time, ability and conditions as well as the author’s own interest, this research paper only focuses on evaluating vocabulary, grammar, and reading section

of the current written final achievement test of the fourth semester for Mechanical Engineering students at a technical university in Hanoi

In fact, the three sections evaluated share the same type of item response (dictonomous items) so it is easier for the author in using data collection and analysis tools at the same time

Besides, experiencing assessment practice at the IU has helped the author realize that vocabulary, grammar and reading section might be the most problematic sections that can affect the reliability and validity of the study

In addition, the study will focus on internal reliability and content validity – two most fundermental qualities of a test

1.5 Significance of the study

First and foremost, the paper will benefit English teachers to get feedback about test contents in which they can identify the weaknesses and strengths of the test From that, they will know how to design tests more properly

Secondly, identifying the “good” and “not so good” points of the test will help improve the English designing tests, test specifications and course books more efficiently Actually, when there is a match between the course books and the tests, the teaching and learning quality will help achieve higher results in their language learning

Finally, it will assist teachers or designers to have a deep insight into evaluating internal reliability and content validity of the tests From that, it can be the reference information for many other researchers who are engaged in designing a proper test for the course objectives and students’ levels and interests

Trang 6

CHAPTER 2 LITERATURE REVIEW 2.1 Language tests

2.1.1 Definition of tests

Bachman (1990, p.20) defines the term “test” as “a measurement instrument designed to elicit a specific sample of an individual’s behavior” The definition provides the basis and general of tests

For the context of IU, where the test is specially designed for Mechanical Engineering students, the conclusion of Douglas seems to match up perfectly Therefore, in this study, an ESP test is defined as a test which has a certain level of authenticity allowing an interaction between the test taker's language ability and specific purpose content knowledge to make inferences about a test taker's capacity to use language in the specific purpose domain

In order to have good tests, besides clearly identify the purpose of the test, types of tests, designing good test specifications play a vital role In the next subsection, main ideas of test specifications will be thoroughly discussed

Generally, specs usually have two key elements: sample(s) of the items or task they

intend to create, and guiding language – everything else However, it is said that the form

of the spec is up to its users, and both form and content evolve in a creative, organic,

Trang 7

consensus-driven, iterative process (Fulcher and Davidson, 2007) Douglas (2000) also notes that specs are dynamic, changing due to feedback from members of the test

development team, from teachers who may be consulted at various points, from subject specialist informants, and from experience gained in trialing, or piloting the test tasks However, in order to generate good tests, it is necessary to clarify test task characteristics Bachman (1996) proposes a framework for task characteristics including 5 main parts: characteristics of the rubric, characteristics of the input, characteristics of the expected response, characteristics of the interaction between input and response and characteristics

of assessment

Douglas (2000) inherits the framework of Bachman and suggests a framework for ESP test task specifications Generally, Douglas (2000)’s framework (i) includes some more ideas of level of authenticity in the part of characteristics of the input and the expected response, and (ii) some ideas in the subparts are rearranged

The highlight of Douglas (2000)’s is that it directly serves the purpose of developing ESP tests which means the context is more suitable for the pilot test used in this research compared with the one of Bachman (1996)

However, the limitation of Douglas’s framework is that it lacks the part of language characteristics in the part “characteristics of the input” which is included in the same

subpart of Bachman (1996) This is understandable because one characteristic of ESP that

it is strenuous, but possible to align the levels (Athanasiou et al (2016)

At IU, the school uses the framework of Douglas (2000) with some adaptation on the part

of the input and response characteristics which are specified as follows:

Characteristics of the

input

Length Language of input Domain

Text level

Characterisrtics of the

response

Response type Skills/ Language construct However, the part of text level will be applied for the reading section only because the vocabulary and grammar section contains short items that has the word count of only 15-

25 words per item

Trang 8

In this study, the adapted framework of Douglas will be exploited to evaluate the compatibility of the test specification and test content in the pilot test with a focus on the analysis of characteristics of the input and the response

2.2 Major characteristics of a good test

2.2.1 Qualities of a good test

2.2.1.1 Qualities of a good English for General Purpose (EGP) test

EGP is the language that is used every day for ordinary things in a variety of common

situations (Delgrego, 2009) Generally, there are various criteria of a good EGP

language test presented by different scholars

Bachman and Palmer (1996) suggests the six criteria as qualities of test usefulness rather than individual factors The ideas of usefulness can be expressed as in Figure 1 below Usefulness = Reliability + Construct validity + Authenticity + Interactiveness + Impact +

Practicality

Figure 1: Test usefulness (Bachman, 1996)

Other leading scholars in testing also share the idea about test characteristics with the two scholars mentioned above Among these test characteristics and others mentioned

by other scholars, it is observable that reliability and validity are all present to be essential to the interpretation and use of measures of language abilities and are the primary qualities to be considered in developing and using tests (Bachman and Palmer (1996) For this reason, in the study, the author would like to examine these essential measurement qualities of the test taken by a large number of second-year mechanical engineering students at IU

2.2.1.2 Qualities of a good EOP test

English for Occupational Purposes is a branch of ESP (English for Specific Purposes) and the aim of the EOP course is to meet occupational English language needs of learners

in their occupational settings EOP is used both in the university context in which students can learn the course to prepare for their future jobs and in the business context (Dudley-Evans and St John, 1998)

To evaluate EOP or ESP test, Douglas (2000) mentions six qualities of good testing practice which are stated to be based heavily on the work of Bachman and Palmer (1996), Bachman et al (1991), and Davidson and Lynch (1993) including: reliability, validity,

Trang 9

situational authenticity, interactional authenticity, impact, and practicality The author elaborates the six qualities as follows: (i) validity is the accurate interpretations that are made of test performance, (ii) reliability is considered the consistency and accuracy of the measurements, (iii) the relationship between the target situation and the test tasks is understood as situational authenticity, (iv) interactional authenticity means that engagement of the test takers’ communicative language ability, (v) impact is seen as the influence the test has on learners, teachers, and educational systems, (vi) the constraints imposed by such factors as money, time, personnel, and educational policies belong to the quality of practicality

Although this research only focuses on reliability and content validity, the test qualities proposed by Douglas (2000) seem to be more suitable to the context of the research Therefore, in this study, the author would like to use the definitions of test qualities of Douglas (2000), especially those on reliability and content validity

2.2.2 Test Reliability

Different authors have defined reliability differently for many years

In research from Council of Europe (2001), reliability is defined as a technical term, which is “basically the extent to which the same rank order of candidates is replicated in two separate (real or simulated) administrations of the same assessment”

Reliability is concerned with a number of aspects which are in need of exploration However, in the subsections that follow, three issues will be discussed: (i) factors affecting language test scores; (ii) types of reliability; (iii) examining test reliability

2.2.2.1 Factors affecting test scores

Shohamy (1985) states that “No measure is perfect” Even if raters are familiar with scoring scales, they are still affected by lots of other factors Five years later, Bachman (1990) clarifies those factors, which are language ability, test method facets, personal attributes and random factors

Obviously, language test scores can be affected, first of all, by test taker’s language ability According to Bachman (1990), the three other factors can be divided into 2 kinds: systematic factors (test method facets and personal attributes), and unsystematic and unpredictable factors (random factors)

2.2.2.2 Types of reliability

Trang 10

Johnson & Christensen (2019) classify four types of reliability which can be outlined as follows:

Types of

reliability

Number of testing sessions

Number of test form

Statistical procedure

Equivalent-forms 1 or 2 2 Correlation coefficient

Internal

consistency

coefficient alpha, or correlation coefficient

In this study, the author aims to use internal reliability According to Korb (2017), if the test has dichotomous items (e.g., right-wrong answers), the Kuder-Richardson 20 (KR-20) formula is the best accepted statistic

In which:

K: the number of test item

p: the proportion of the examinees who got the item correct

q: the proportion of the examinees who got the item incorrect

s2: the variance (var) of the scores

∑pq: the summation of the product of p and q

The KR-20 is used for items that have varying difficulty (some items might be very easy, others more challenging)

In this study, the author aims to use internal reliability and base on characteristic

of the evaluated test (the test has dichotomous items (e.g., right-wrong answers), the formular of Kuder Richardson 20 will be used for research question 3

2.2.3 Test Validity

2.2.3.1 Types of validity

Bachman (1990) claims that validity is considered the most important quality of test interpretation or test use Test scores are the key factor in examining qualities related to the validity of a test, along with the teaching syllabus, the test specification and other

factors In fact, there are a variety of perspectives on the concept of validity, which leads

Trang 11

to the fact that this most crucial quality of a test has been categorized differently

Content validity

Test users generally tend to examine the test content from the copy of the test and/ or test design guidelines That is to say, test specifications and example items are to be inspected Also, test developers usually focus on the content or ability domain covered in the test from which test tasks/ items are generated It seems that consideration of the test content plays a significant role to test users and test developers Commenting on content validity, Bachman (1990: 244) suggests that demonstrating test relevance and coverage is

a necessary part of validation Evaluating content validity is like finding the content of

the test whether or not is “sufficiently representative and comprehensive for the test to be

a valid measure of what it is supposed to measure” (Henning, 2001:91)

There are two aspects of content validity mentioned in Bachman (1990) namely: content relevance and content coverage He notes that content relevance should be considered in the specification of the ability domain – or the constructs to be tested, and the test method facets – aspects of the whole testing procedure This is directly linked with the test design process to see whether the items generated for the test can reflect the constructs to be measured and the nature of the responses that the test taker is expected to make The limitation of content validity is that it does not take into account the actual performance

of test takers (Cronbach, 1971; Bachman, 1990) It is an essential part of the validation process, but it is sufficient all by itself as inferences about examinees’ abilities cannot be made from it

2.2.3.2 Examining the test content validity

Messick (1980) points that content validity, along with criterion validity, is considered as part of construct validity in the view of “unifying concept”

To ensure scoring validity, which is considered “the superordinate for all the aspects of reliability” (Weir, 2005:22), test administrators and developers need to see the “extent to which test results are stable over time, consistent in terms of the content sampling, and free from bias” (Weir, 2005:23) In this sense, scoring validity helps provide evidence to support both the content validity and reliability

The quality of test items can be assessed using item analysis This will help improve the teacher’s ability in creating test items Item analysis can be used to evaluate if the item is

Trang 12

difficult or easy (Tracy, 2012) According to Polit and Yang (2015), item analysis is done

to evaluate which items to discard, to retain, and needs revision Therefore, a test needs item analysis to evaluate its performance Among these characteristics are test difficulty and discriminability

Item Difficulty is the percentage of people who answer an item correctly It is the

relative frequency with which examinees choose the correct response (Thorndike, Cunningham, Thorndike, & Hagen, 1991) It has an index ranging from a low of 0 to a high of +1.00 Higher difficulty indexes indicate easier items An item answered correctly

by 75% of the examinees has an item difficult level of 75 The formula used to calculate the item difficulty is

p=[(PT+PB)/N] *100

(Crocker& Algina, 1986)

In which:

PT = the number in the upper group who answered the item correctly

PB = the number in the lower group who answered the item correctly

N = the total number who tried the item

The higher the value, the easier the question

Table 10 will show the range of item difficulty used in the study from Obon and Rey (2019)

Table 9: Difficulty index Obon and Rey (2019)

Item property Index Interpretation

Difficulty

0.0-0.30 0.25-0.75 0.76-1.00

Very difficult Average Very easy Discriminability

A test item discrimination index refers to the ability of an item in distinguishing the test takers who have high grades and the test takers who have low grades The test item discrimination index is based on the opinion of Kubiszyn & Borich (2003): discrimination index measures the extent to which a test item discriminates or differentiates between students who do well on the overall test and those who do not do well on the overall test The formula for item discriminability is given as follows:

(8) Discrimination index = (PT-PB)/n (n is the number of examinees in each group)

Trang 13

The range of discriminability is from 0 to 1 The greater the D index is, the better the discriminability is

In this study, the author intends to measure item difficulty and discriminability to evaluate the reliability of the chosen test

In summary, the current paper followed a combination of methods in assessing the content validity of the reading test It is a process spanning before and after the test event For the pre-test stage, the test content was judged by comparing it with the test specification Later the test scores were analyzed in the post-test stage for support of the content validity by examining if the content of the specific item needs reviewing based

on the analysis of item difficulty and item discrimination to the test specification

2.3 Language testing in EOP

2.3.1 Language testing in EOP

Bachman (1990) defines language competence (language knowledge) as a component of language ability besides strategic competence Language knowledge includes two broad categories: organizational knowledge and pragmatic knowledge

Douglas (2000) also proposes a model that distinguish language knowledge, strategic competence and background knowledge His way is to adapt Bachman and Palmer’s formulation of language knowledge with a modified formulation of strategic competence (Chapelle and Douglas 1993) The model mentions language knowledge consists of grammatical knowledge (knowledge of vocabulary, morphology, syntax and phonology), textual knowledge (knowledge of how to structure and organize language into larger units: rhetorical organization; and how to mark such organization: cohesion), functional knowledge

2.3.2 Testing reading in EOP

Reading competence is defined as “understanding, using, reflecting on and engaging with written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society” (Organization for Economic Cooperation and Development (2010)) Under this definition, reading competence is essential to access knowledge It is, therefore, a basic and essential competence, not only to function in an academic context but also to develop any personal, working or social activity (Zayas, 2012: 19)

Reading skills are considered “receptive skills”, which leads to the “basic problem” of

Trang 14

setting tasks which will not only cause the candidate to exercise reading skills, but will also result in behavior that will demonstrate the successful use of those skills (Hughes, 2003)

In the same book of Hughes (2003), the author mentions some possible techniques of designing a reading task including: multiple choice, short answer, gap filling and information transfer

It is clear that the reading passages in the researched test belong to multiple choice and gap filling task type

2.3.3 Testing grammar in EOP

Grammar may be roughly defined as the way a language manipulates and combines words in order to form longer units of meaning There is a set of rules which govern how units of meaning may be constructed in any language: one may say that a learner who knows grammar is one who has mastered and can apply these rules to express him or herself in the acceptability of the language forms (Chung and Pullum, 2015)

Hughes (2003) mentions four techniques for testing grammar including: gap filling, paraphrase, completion, and multiple choice

For the context of IU, the grammar part of the achievement test has the form of multiple choice technique which means that one sentence has 4 options for students to choose from but the instruction is “Identify the mistake (A, B, C or D) in each sentence and correct it.” So, in fact the type of questions in the test is a combination of multiple-choice error identification task and gap-filling task because the test takers are supposed to not only identify the mistake but also correct it

2.3.4 Testing vocabulary in EOP

In ESP testing, Douglas (2000) also puts knowledge of vocabulary in the sub-category of grammatical knowledge, as presented in Table 13 about communicative language ability Vocabulary testing would be discussed in 3 following points: (i) dimension of vocabulary assessment, (ii) components of vocabulary knowledge, (iii) testing techniques and (iv) characteristics of ESP vocabulary

(i) Dimension of vocabulary assessment

According to Read (2000), there are 3 dimensions of vocabulary assessment including: (i) discrete – embedded, (ii) selective – comprehensive, (iii) context-independent – context-

Tiêu đề	An investigation into the quality of the final test for second-year students at a technical university in Hanoi
Tác giả	Nguyễn Thị Hoài Anh
Trường học	Vietnam National University Hanoi
Chuyên ngành	English Teaching Methodology
Thể loại	Thesis
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	28
Dung lượng	450,02 KB