000036888 of the final achievement test of basic english for first year non english major students at bacgiang teachers training college

B ackground t o th e stu d y

Testing plays an integral role in the teaching and learning process, serving as a key tool to measure what learners achieve as they encounter new ideas and ways of organizing knowledge It contributes to the success of teaching by providing a clear picture of learning outcomes and by guiding instructional decisions Beyond estimation, tests push teaching and learning forward by delivering feedback on what students have learned and signaling the next steps in the learning journey Through assessment, teachers gain deeper insights into individual students’ abilities, interests, attitudes, and needs, enabling more effective teaching strategies and stronger motivation.

Tests come in many types, each serving different purposes Achievement tests are designed to measure students' mastery of a specific syllabus, such as end-of-term or end-of-course assessments Consequently, the content of achievement tests should be drawn directly from the detailed course syllabus or course books to ensure alignment with what has been taught Two essential validity considerations for a high-quality achievement test are content validity and face validity Because the main purpose of achievement tests is to evaluate students’ learning outcomes and mastery of the material, careful alignment between the test content and the curriculum is critical for accurate measurement.

Effective assessment hinges on aligning test content with the objectives of teaching and learning, ensuring content validity by covering all areas to be assessed and enabling accurate feedback on students’ weaknesses in vocabulary, grammar, listening, and other skills so instruction can be adjusted and learners can focus their efforts The connection among learning, teaching, and testing is captured by content validity, which defines the purposes of assessment and the aims of instruction Beyond this alignment, the perceived validity of the test by test-takers and other stakeholders matters; when people review the test they should feel that it measures what it is supposed to measure, which is known as face validity.

Ensuring both face and content validity of an achievement test begins with developing a precise test specification for the content to be learned at the earliest stage of test development, so the test’s purpose and the content it will cover are stated as clearly as possible Reexamining the alignment between each test item and its specification is another key step to safeguard content validity This study provides a detailed discussion of these two aspects of achievement testing, with the aim of assisting in evaluating the current language-testing situation at Bacgiang Teacher Training College (BGTTC).

Obviously, testing is an important part of every teaching and learning process As a teacher of English, I am aware of the importance of testing However, from my own experience and from the anecdotal evidences provided by both teachers and students at BG TTC, there are many problems with the English achievement tests for non-English major students Some of them are not suitable for students' level,

2 b ein g either to o d iffic u lt o r to o easy Som e o f th em do n o t test w hat has been ta u g h t as they are tak en from som e available co m m ercial tests elsew h ere th at are n o t su itab le fo r stu d e n ts at B G T T C In addition to that, stu d en ts usually com plain a b o u t th e scoring T h ey usu ally ask th e ir te ach er for keys and co m p are th eir tests w ith th e ir frie n d s’ a fte r th e ex am in atio n s H o w ev er, th e m ark s they get are either h ig h er or low er th a n w h at th ey ex p ected E specially, th e scores m ade by the seco n d sem ester stu d e n ts h a v e recen tly b een very low w ith m o re than 7 5 % o f the stu d en ts w ho failed th e exam T h erefo re, test results m ay fail to m easure accu rately w h atev er th ey a re in ten d ed to m easure, su ch tests lack o f validity and reliability A s th e resu lts, stu d e n ts ’ tru e abilities are no t alw ays reflected in the test sco res th at they o b tain

Despite feedback from students and teachers about poor test quality and scoring, no formal evaluation of English achievement tests has been conducted This gap highlights the need for the study reported in this thesis An evaluation of the English achievement tests should be carried out to determine whether the final achievement test of Basic English for first-year non-English-major students at the university is valid, reliable, and fair, and to assess whether it truly measures students’ English proficiency.

B G T T C is o f g o o d qu ality an d the results o f the study w ill help im prove the testin g situ atio n s at B G T T C

A ims of th e s t u d y

T o evaluate w h e th e r th e final ach iev em en t tests o f G en eral E n g lish for first y ear non -E n g lish m ajo r stu d e n ts at B G T T C h av e face and co n ten t validity.

T herefore, th e stu d y aim ed a t an sw erin g the fo llo w in g q uestions

1 D o es the test d e v e lo p m e n t p ro ced u re fo r first y e a r n o n -E n g lish m ajo r students at B G T T C fo llo w a ra tio n a l a p p ro a c h th at ensures a p p ro p ria te co n ten t?

2 H o w is the final ac h ie v e m e n t tests o f B asic E n g lish (T ests 2) p erceiv ed by both teach ers and te st-ta k e rs?

3 H o w closely d o e s th e c o n te n t o f T ests 2 m atch th e te st o b jectiv es?

S cope o f th e s t u d y

Within the framework of a minor thesis, the scope of this study is limited to evaluating the validity of the final achievement tests of Basic English The research concentrates on determining whether these tests accurately measure the intended language competencies and reflect the course objectives By focusing solely on validity, the study avoids exploring other assessment properties and provides targeted evidence to support test-based decisions in Basic English.

This study investigates the face validity and content validity of an assessment designed for first-year non-English major students at BG TTC To achieve this aim, data were collected from diverse sources, including survey questionnaires and the analysis of Test 2.

Two survey questionnaires were administered to 14 teachers in the English Section and to 166 second-semester non-English major students at BGTTC to capture their opinions on Test 2 and on how the tests are constructed In addition, two existing versions of Test 2 were analyzed to gather evidence of their content validity, using item-to-descriptor matching and subsequent statistical analysis of the results In short, the study presents the collected and analyzed data on the currently used test and offers concrete suggestions for its improvement.

Due to time constraints, limited ability, and other conditions, the author cannot examine all qualities of the test—such as reliability, practicality, and other aspects of test validity—and cannot cover all ten used final tests Only two of the used tests are examined, and a suggested test specification for test 2 is presented.

O utline of th e t h e s is

T h e thesis is o rg a n iz e d into 5 chapters as follow s:

C h a p te r 1- In tro d u c tio n - p resen ts th e b a ck g ro u n d to th e study, th e o b jectives, the sig n ifican ce an d th e o u tlin e o f the study.

C h a p te r 2 - L ite ra tu re R ev iew - p resen ts a rev iew o f lite ra tu re th at p ro v id es the

4 th eoretical b asis fo r ev a lu a tin g the face and content v alid ity o f ach iev em en t tests

This article defines achievement tests and outlines their types, then examines the characteristics of a good language test with a focus on face validity and content validity, followed by a discussion of the theoretical framework for test development, and finally a critical review of related previous studies with comments on their data collection instruments.

C h a p te r 3 - M eth o d o lo g y - describes the data c o llectio n instru m en ts o f th e study

A study investigates how teachers design tests and gathers both teachers’ and students’ opinions on the content, format, and layout of the assessments through two questionnaires To strengthen data reliability, two tests were randomly selected from a pool of ten and analyzed to confirm the information on content validity obtained from the questionnaires.

C h a p te r 4 - R esu lts and D iscu ssio n s - provides th e resu lts, fin d in g s o f the study and d iscu sses the findings.

C h a p te r 5 - S u g g estio n s an d C o n clu sio n - m ak es so m e su g g estio n s to im prove the cu rren t testin g situ atio n s at B G T T C and co n clu d es th e study.

This chapter provides an overview of the theoretical background for the study and is organized into three parts: the first part discusses the achievement test and its types; the second part offers a brief review of the major characteristics of a good test, with a focus on content validity and face validity; the third part presents the theoretical framework for test development; finally, a review of previous studies is provided.

A ch iev em en t t e s t s

Definition

Achievement tests are defined in different ways by researchers Brown (1994) conceptualizes an achievement test as being related directly to classroom lessons, a unit, or even an entire curriculum, and limited to the materials covered within a curriculum over a specific time frame Put simply, an achievement test assesses a student’s language in relation to the curriculum they have studied in a given course It is a measurement tool designed to gauge students’ progress Through an achievement test, both teachers and students can track progress in teaching and learning: the teacher evaluates how effectively they have taught, while the student sees how well they have learned and can diagnose the areas that were not well taught or not well learned.

S h arin g the s a m e opin io n , H ughes (1 9 8 9 ) p o in ts o u t th a t th e purp o se o f ach ievem ent te s t is to assess “h o w su ccessfu l ind iv id u al stu d en ts, groups o f

Six students—or the courses themselves—have been involved in meeting the stated objectives From this perspective, the achievement test is designed to determine how much of the course content each student has learned and to verify that that content is actually taught within the curriculum In other words, the test measures whether the learning outcomes outlined by the course are reflected in what has been taught and learned.

R elated to th e rela tio n sh ip betw een ach ievem ent tests and o th e r k in d s o f tests,

According to Harrison (1983, p.7), there is a clear distinction between achievement tests and diagnostic tests: an achievement test looks back over a long period of learning, while a diagnostic test is used during instruction Achievement tests are typically given at the end of a course to assess what students have learned across the whole course or even across several courses, with the measurement focused on the course objectives Unlike achievement tests, diagnostic tests are often needed during the course; they may be administered at the end of a unit or after a lesson to identify remedy errors related to persistent learning difficulties Consequently, diagnostic tests measure the most common learning errors.

Achievement tests play a crucial role in the educational program by measuring students' acquired language knowledge and skills during the course of study They help teachers and students reflect on what the syllabus has covered and plan for future teaching and learning.

Kinds o f achievement tests

Achievement tests can be subdivided into different kinds depending on researchers’ points of view For example, Gronlund (1985) divides achievement tests into two kinds: standardized achievement tests and criterion-referenced tests.

Achievement tests come in two main forms: standardized achievement tests (usually norm-referenced) and criterion-referenced achievement tests Standardized tests measure how well students achieve common objectives across many schools and are developed by test specialists, retested, and selected based on their effectiveness and relevance to rigid specifications, which yields high-quality test items By contrast, criterion-referenced achievement tests assess a student’s mastery of instructional objectives and are interpreted by the extent to which a limited set of specified educational tasks has been mastered.

Within classroom assessment, which is typically designed by teachers, achievement tests are categorized into two main types: progress achievement tests, which monitor ongoing learning, and final achievement tests, which assess overall mastery at the end of a course These categories are explained in the following sections.

Progress achievement tests are designed to measure the progress that students are making Typically, these assessments are teacher-made and prepared by class teachers who know their students well and are familiar with the curriculum and instructional program they've followed, enabling them to gauge how much has been taught and how much has been learned Consequently, teachers can identify students' strengths and weaknesses and diagnose the areas that have not been adequately achieved during the course of study.

In th e sam e v ein , H eato n (1988, p 163) also em p h asizes th a t the p ro g ress te st is

Based on the language program the class has been following, this assessment is as important as evaluating the teacher’s own work and the student’s learning By taking part in this kind of test, students have a strong opportunity to perform the target language in a positive and effective manner and to gain additional learning benefits, confidence, and evidence of progress that supports both teaching quality and student development.

Building confidence in handling test items is a key outcome, and teachers support this by helping students become familiar with the formats used in the tests and by teaching the strategies needed to tackle them When learners understand what to expect and how to approach each question, they gain assurance and perform more effectively This preparatory, supportive work is widely regarded as a valuable step toward success on the final achievement tests.

Final achievement tests, like progress tests in some respects, are closely tied to language courses They measure how well the course's objectives are achieved by individuals, groups of students, or the course as a whole.

According to Heaton (1988, p 163), final achievement tests differ from other assessments in that they are more formal and are intended to measure achievement on a larger scale They are usually given to students at the end of the school year or at the end of a course of study to measure how far students have achieved the teaching goals These tests may be written and administered by ministries of education, official examining boards, or by members of teaching institutions, based on detailed test specifications that cover the whole teaching content and objectives.

According to Hughes (1989:11), the content of a final achievement test should be based directly on a detailed course syllabus or on the course objectives He proposes two design approaches for achievement tests: the syllabus-content approach and the objective-content approach In the syllabus-content approach, a final achievement test is designed to cover only what students have actually learned by referencing the detailed course syllabus or the books and other materials used in the course of study This alignment helps ensure the test is fair because it mirrors the curriculum and instructional emphasis However, besides that strength, this approach can be limited by its focus on documented content, potentially neglecting broader competencies or skills not explicitly listed.

An approach that relies on the syllabus contents can cause problems if the syllabus is poorly designed or the textbooks and materials are unsuitable, introducing content that diverges from the course objectives and leading to misleading results that do not reflect students’ true achievement of those objectives This drawback of a syllabus‑content alignment can be avoided with an objective‑content approach, where the content of the final achievement test is directly based on the course objectives Since course objectives are typically reflected in the syllabus and the course books, a test designer who ensures clear alignment with these objectives can include them in test construction, resulting in assessments that more accurately indicate students’ attainment in the course of study.

T est C h a ra cteristics

Test Reliability

Reliability of a test refers to its consistency; a test is considered reliable if its results are the same or nearly the same across administrations Consistency, then, concerns how a test performs across time, across different forms, and across its items, which correspond to three types of reliability: test-retest reliability, parallel-form reliability, and internal consistency Test-retest reliability measures the stability of scores when the same group of students takes the same test on separate occasions Parallel-form reliability assesses the agreement of scores between two equivalent versions of the test Internal consistency evaluates how well the items on a single administration measure the same underlying construct Together, these forms of reliability describe how dependable a test is in producing stable results.

Test reliability comes in several forms Test-retest reliability checks stability over time by administering the same test to the same group on separate occasions, with the Pearson product-moment correlation coefficient (r) used to gauge consistency; a coefficient around 80 indicates a strong, acceptable level of reliability across administrations Reliability across forms examines parallel tests that share the same number of items, item types, difficulty levels, and content areas but differ in content; when given to the same group, these forms should yield similar results, with r around 85 considered acceptable Finally, reliability across test items, or internal consistency, asks how consistently the items measure the intended construct in a single administration The common methods here are split-half reliability and Cronbach’s Alpha, where the test is divided into two parts (or overall alpha is calculated), correlations are computed, and these are averaged to estimate the test’s internal consistency.

Bachman (1990) identifies three factors that can influence the reliability of a test The first is test-method facets, which include the environment, scoring rubrics, input, the expected response, and the relationship between input and the expected outcome The second factor is personal attributes, such as age, gender, cognitive style, and background The third factor comprises random factors, such as tiredness, emotional state, and random variations in the testing environment.

According to Heaton (1988), several factors influence the reliability of a test, described in simpler terms These factors include how the test is administered, the timing and duration of the test, the testing conditions, how the test is observed or controlled during administration, the length of the test, and the clarity of the test instructions and scoring methods.

Reliability is a fundamental quality of any good test; when results are unreliable, the assessment derived from them is unreliable as well To ensure a test is reliable, focus on three core aspects of reliability that matter most for sound measurement: test–retest reliability, which reflects stability of scores over time; inter-rater reliability, which measures consistency across different scorers; and internal consistency, which indicates how well the test items hang together to measure the same construct By attending to these facets, researchers can produce tests that yield stable, interpretable results and support trustworthy conclusions in any evaluation.

11 te a c h e r and stu d e n ts are ‘the circum stances in w h ich th e test is taken, the w ay in w h ich it is m a rk e d and the uniform ity o f assessm en t it m a k e s.’ (H arriso n 1983,P-10)

Test Practicality

Test practicality refers to the practical factors that test developers must consider, including financial limitations, time constraints, ease of administration, and scoring and interpretation Bachman and Palmer (1996, p 350) state that practicality pertains to how a test will be implemented and, to a large degree, whether it will be developed and used at all They identify three types of resources essential to estimating the monetary costs for a test: human resources (such as test writers, scorers or raters, test administrators, and clerical and technical support staff); material resources (space for development and administration, and equipment like typewriters, cassette players, overhead projectors, tape and video recorders, and computers); and test materials (test papers, answer sheets, pencils) Time is another key dimension, encompassing total development time as well as time for specific tasks such as design, writing, administration, scoring, and analysis.

Test Discrimination

Discrimination in testing refers to the capacity of an assessment to differentiate among different students and to reflect real differences in performance within a group (Eaton, 1988, p 165) He argues that a 70% score means nothing unless the other scores from the test are known Furthermore, tests in which almost all candidates score around 70% clearly fail to discriminate between the various students If a test is too easy or too difficult, it will not reveal meaningful discrimination among examinees.

12 students Obviously, discrimination is needed with placement tests but it may not be needed with tests concerned with how much of a syllabus students have mastered such as achievement tests.

Test Validity

Validity is arguably the most important criterion for the characteristics of a good language test, and researchers in language testing have approached it from multiple perspectives Bachman (1990, p.237) defines validity as the degree to which evidence supports the inferences drawn from test scores, with a emphasis on interpreting scores rather than evaluating the test itself Cronbach (1971) likewise asserts that establishing test validity involves collecting evidence to support the kinds of inferences that can be appropriately drawn from test scores Together, these views highlight that validity centers on the soundness of score interpretations—rooted in evidence—rather than on the test format alone.

Validity is conceived as the fidelity with which a test measures what it purports to measure, a view emphasized by Garret (1937, p.324) This perspective is echoed by Hughes (1989, p.22), who notes that a test is valid when it measures accurately what it is intended to measure, and by Moore (1992, p.67), who defines validity as the degree to which a test measures what it is supposed to measure Anastasi (1988, p.139) likewise highlights that validity concerns both what the test measures and how well it does so A frequently cited definition, especially for class tests, was presented by Henning (1987) in A Guide to Language Testing.

Validity refers to how appropriate a test or any of its parts is for measuring what it claims to measure A test is considered valid to the extent that it accurately assesses the intended construct, rather than something else In short, the term 'valid' describes measurements that truly reflect the attribute they are meant to quantify, ensuring the test results are meaningful for its stated purpose.

13 when used to describe a test should usually be accompanied by the preposition ‘for’ Any test then may be valid for some purposes, but not for others.(p.89)

Test validity hinges on how well the test content matches its intended focus A high-validity test closely reflects the content areas and competencies it is meant to measure; when validity is poor, the test fails to capture the targeted skills In practice, tests are valid or invalid only with respect to their intended use For example, a test designed to assess reading ability that also measures writing may not be valid for testing reading alone, but it can be valid for the broader purpose of evaluating both reading and writing together.

Validity refers to the appropriateness or correctness of inferences, decisions, or descriptions drawn from test results about individuals, groups, or institutions When evaluating a test’s characteristics, validity should be considered from the multiple perspectives proposed by researchers to capture its various dimensions In addition, validity must be assessed in terms of the accuracy of a specific inference about the test taker.

V alidity is d iv id ed into d ifferen t types such as face valid ity , co n ten t validity, co n stru ct valid ity , c o n cu rren t validity, p red ictiv e v alid ity etc Each type corresponds to a d ifferen t research p u rp o se H o w ev e r, acco rd in g to H arrison

(1983, p 11) th e first tw o types are ‘v ital fo r th e teach er settin g his ow n te s t’

T herefore, this study w ill focus on th ese tw o types o f validity: face validity and content validity.

A ccording to H eato n (1 9 8 8 , p 153), ‘ I f a te st item looks rig h t to o th er testers,

14 teachers, moderators, and testees, it can be described as having at least face validity' That is, face validity is concerned with what these people think of the test For example, they expect question paper to be clearer, error-free, written in plain, fairly formal language, with suitable level of difficulty etc.

Face validity is commonly defined by Cohen (1994, p.42) as the degree to which a test appears to measure what it is supposed to measure It does not need to be judged solely by content experts; any reviewer can offer useful information about the test’s apparent appropriateness Consider a scenario where examinees are expected to perform at their best, yet some test items seem incongruent with the assessment’s aims In such cases, a test-taker may refrain from fully disclosing their ability, leading to lower scores and a potential compromise of the test’s validity Cohen (1994, pp 42-43) further clarifies face validity by articulating three criteria for evaluating it from the perspective of students.

1 T heir perceptions o f any bias in test content (i.e., w hether they perceive the content to favor a respondent with certain background know ledge or expertise)

2 T heir understanding o f the nature o f the task that they are being requested to com plete.

3 T heir aw areness o f the nature o f their perform ance o f the test as a w hole and on any particular subtests (for exam ple, the test taking strategies that they em ployed) (pp.42-43)

Following this view, Brown (1994a, p.385) notes that face validity rests on students' perception that a test appears valid, and he outlines practical principles that test writers can apply to assess and strengthen the face validity of their assessments.

• a careful constructed, well thought-out format;

• item s that are c le a r and uncom plicated;

• directio n s th a t a re crystal clear;

• tasks th at a re fam iliar th at relate to th eir co u rse w ork;

• a difficu lty lev el that is ap p ro p riate for y o u r students;

• te st co n d itio n s th at arc biased fo r beast, th at b rin g o u t stu d e n ts’ best p erform ance.

The v a lu e o f face v alid ity has been m uch m o re em p h asiz ed w ith the advent o f

C om m u n icativ e L a n g u ag e T estin g (C LT ) M an y research ers, in clu d in g M orrow

Researchers in 1979 and 1986, together with Carroll (1980, 1985), who support communicative language testing (CLT), share the view that a CLT-based assessment should resemble real-world language use—something that could function in the real world with language—and they attribute this appeal to real life to face validity.

Alderson, Clapham, and Wall (1995, p 172) recognize face validity as an influential factor in testing Although students are not experts, their opinions about the test’s appearance—the face validity or “look” of the test—are important because this is the kind of response you get from test-takers If a test does not appear valid to test-takers, they may not perform their best, making the perceptions of non-experts useful in this context The authors emphasize that the only way to uncover this quality is by interviewing or surveying teachers’ and students’ attitudes or feelings about the test they have just taken or examined.

From the definitions and opinions presented above, face validity of a test refers to what the test appears to measure on the surface rather than what it actually measures; thus, face validity rests on the perceptions of laypeople—administrators, non-expert users, and students—about the test’s apparent purpose, content, and relevance.

16 te s t's in stru ctio n s, c o n te n t, item form at and layout (C o h en , 1994).

2.2.4.1.2 Major considerations in measuring fa c e validity o f a test

To enhance the face validity of a test, its construction should carefully address test instructions, test content, item format, and the overall layout When any of these factors is poorly prepared—such as unclear directions, inadequate time limits, or poorly designed test items—the items or assessment tasks may not function as intended, which can lower the test's face validity.

T h e in stru ctio n s o f a te st, acc o rd in g to G ro n lu n d (2 0 0 3 ), sh o u ld in clude six points

The purpose of the test should be clearly stated in the test instructions to align expectations with learning objectives This purpose can be introduced at the start of the semester or course, or at the time the test is announced, and it should be reiterated at the time of testing—whether orally or in writing—especially when the exam covers content from several sections taught by different teachers.

Exam instructions should specify the time allotted for each section and for the test as a whole When students know the duration for every part and the entire exam, they can allocate their time across sections in a reasonable and effective way This structured time management helps prevent spending too much time on questions that are difficult, ensuring a smoother, more balanced pace throughout the test.

T est DEVELOPMENT

Design stage

Figure 1 shows that the initial stage in test development is to generate a list of design statements that provide essential background information about the planned examination program This information is then used to focus and guide the remaining stages of the test development process Six components included in this stage are described below.

2.3.1.1 Description of the purpose of the test

Test design begins with clarifying the test’s purpose and intended uses, because that purpose dictates the time and resources invested By describing the test’s purpose, developers identify the specific inferences about language ability that will be drawn from results and the decisions those inferences will support; Bachman and Palmer (1996) describe this as clearly stating the intended inferences and the decisions that flow from test results The main decision types include grading, diagnostics, selection, and placement, with the most common impact on test takers, teachers, and programs For students, grading decisions are typically the most influential, while decisions about teachers and programs are guided by inferred language ability and used to assess instructional effectiveness, inform salaries and promotions, and drive program changes An example is the final achievement test for first-year students at BGTTC, designed to measure mastery of specific lexical and grammatical forms.

This assessment performs a specific function to measure students’ reading comprehension ability, and its results are used to make decisions about students’ progress and grades, reflecting the degree to which they meet the course objectives; decisions about teachers or programs are not involved.

2.3.1.2 Description of the TLU domain and task types

Target Language Use (TLU) domain refers to a set of authentic, real-world language tasks that test-takers are likely to encounter outside the testing environment, and these tasks are used to ensure that our inferences about language ability generalize to actual communicative performance beyond the exam (Bachman & Palmer, 1996, p 44).

To develop real test tasks, we need detailed descriptions of the various TLU task types This involves identifying and describing test tasks in the TLU domain so that explicit, comprehensive task descriptions can serve as a reference for test developers when designing tasks whose characteristics reflect the features of TLU tasks In achievement tests at BGTTC, TLU tasks are typically based on the course syllabus or the textbook.

A list of TLU tasks may be like “Rewriting the sentences, using pass voice, completing sentences or paragraph with suitable words, choosing suitable tense for the verbs, etc.”

2.3.1.3 Description of the characteristics of the test takers

Describing the characteristics of test takers guides later stages of test development by identifying which traits may influence performance and must be considered when defining the features of the test tasks In other words, a detailed analysis of test-taker profiles informs item design, scoring approaches, and the overall test structure to ensure the assessment measures the intended constructs across diverse groups This approach enhances validity and fairness by aligning test tasks with how different test takers approach items.

Developers should integrate into their plan a detailed profile of test takers, covering personal characteristics, topical knowledge, overall proficiency level and language ability profile, and likely effective response strategies for test tasks This information enables tailored item design and scoring criteria, improves assessment validity, and supports reliable measurement across diverse learner groups, while aligning content with SEO-friendly terms such as test takers, language proficiency, topical knowledge, response strategies, and assessment design.

Alderson et al (1995, p 38) emphasize that describing test taker characteristics is central to effective test design, and they insist that 'above all, specifications writers must first decide who their audience is, and provide the appropriate information.' This audience-centered principle guides the creation of clear, relevant, and actionable test specifications by identifying who will use the document—test developers, educators, administrators, and test-takers—and what information they require By outlining demographic and contextual profiles, skill levels, and testing conditions, designers tailor content, format, scoring, and reporting to align with user needs and practical assessment goals.

Crucial information is too often overlooked in BGTTC’s test development process, reflecting a broader lack of diligence among many test developers This omission can influence how test tasks are selected and may subsequently impact students’ performance on the exam.

2.3.1.4 Definition of the construct(s) to be measured

Defining the construct to be measured is a critical step in the test design process, because it explicitly specifies the exact nature of the ability under assessment in abstract terms The construct can be understood as the capacity to use language to perform a given task—for example, recognizing specific details on a reading passage or understanding vocabulary and its meaning Consequently, designers must decide which abilities to include or exclude from the construct definition, with the decision guided by the inferences that will be drawn from the test results.

2.3.1.5 Plan for evaluating the qualities of test usefulness

Because usefulness is an essential consideration in all stages of test development,

Developing a formal plan to assess a test’s usefulness is essential to keep its usefulness consistently high across its six qualities According to Bachman and Palmer (1996), the plan consists of three parts: an initial step that considers the right balance among the six usefulness qualities and establishes minimum acceptable levels for each; a logical evaluation phase that employs a set of questions and a checklist to review the design statement, the test blueprint, and the test tasks; and a final procedure for collecting qualitative and quantitative evidence during the administration stage.

Balancing the six qualities of usefulness starts by setting minimum acceptable levels for each quality and recognizing that the appropriate balance and thresholds are context‑dependent A logical usefulness assessment can be conducted by following a structured checklist for evaluating usefulness and by collecting data on performance across the qualities This evaluation plan is flexible and can be modified at different stages, and additional methods for testing usefulness are encouraged to expand insight and reliability.

2.3.1.6 Identification of resources and development of a plan for their allocation and management

Effective resource identification and allocation are fundamental in test development because they determine the project’s feasibility In this phase, the required resources—human, material, and time—needed for each activity and the resources available are identified, and a plan is created to allocate and manage them throughout the development process to ensure smooth progress and timely delivery.

Human resources that refer to test developer, test writer, scorer, test administrator and clerical support are the individuals involved in the test development process.

In test development, individuals perform different roles and functions, though in classroom testing these tasks can be carried out by a single person—the teacher—who writes, administers, analyzes results, and archives the materials Key resources include space, equipment, and test materials, while time is a critical factor divided into development time and the time required to complete each stage of the development process.

Ultimately, the test content is defined by the design statement produced in this stage, and the outcomes from this stage are then integrated into the operationalization stage, where they guide the development of the test items.

Operationalization

During the design stage, the components of the design statement establish the basis for developing test tasks, formulating test task specifications, and producing a comprehensive blueprint for the test as a whole Consequently, the operationalization stage focuses on two interrelated activities described below, aligning task design with measurement objectives and ensuring that the test is executable, reliable, and coherent with the overall testing framework.

A test task specification is a specialized form of technical writing used in developing a set of test items It provides clear directions for preparing each item, detailing eligible item formats, the types of directions, limits for the stem, and the characteristics of the response options, as well as the features that define the correct answer and the distractors.

In fact, Bachman and Palmer (1996) stipulate that the firmest basis for a good set of test task specification should include fully seven elements, indicating the

+ The purpose of the task

+The definition of the construct to be measured

+ The characteristics of the setting of the test tasks

+lnstructions for preparing to the tasks

+Characteristics of input, response and the relationship between input and response

Using the design-stage outputs—characteristics of TLU task types, the test’s intended purpose, the construct to be measured, and the available development resources—the precise characteristics of the test tasks are established This is accomplished by either adapting TLU task types as test tasks that satisfy usefulness criteria or by developing original test task types whose features align with TLU task characteristics, and by defining the specific purpose and construction definition for each task type.

Bachman and Palmer define a test blueprint as a detailed plan that provides the basis for developing an entire assessment The blueprint is created to clarify the test developers’ intentions, enable the development of additional tests or parallel forms with the same characteristics, and allow evaluation of how closely the final test matches its blueprint and its overall authenticity Within their test-development framework, the blueprint has two interrelated parts: a test structure, detailing the number, salience, sequence, relative importance, and number of tasks per part; and the task specifications that outline the requirements for each task type in alignment with the test structure.

A test blueprint can be conveniently presented as a table, often called a table of test specifications Most blueprints list the major content areas to be covered and the cognitive levels intended for each test form A typical blueprint appears as a two‑way chart, with content areas in the rows and cognitive processes in the columns, and the total number of items for each cognitive level on the overall test, along with the number of items per row, indicating the proportional emphasis of each content area (Kubeszyn and Borich 2003).

Two core elements shape a test blueprint: the list of content areas and the levels of cognitive processing These components define the content domain and test objectives, and they drive how the assessment samples content to ensure adequate representation of the domain.

Content area defines the topics to be covered in a course and the objectives for which test items will be written Scholars such as Linn and Miller (1995), Osterlind (1989), and Kubiszyn and Borich (2003) agree that a content-area outline links course topics to the testing goals A test objective is a clear, concise statement of the skills students are expected to perform after instruction In other words, test objectives reflect the intended learning outcomes of the course and any special conditions under which learning will take place If the observable learning outcomes are to occur at a specific time, in a particular place, with certain materials, equipment, or resources, those conditions must be stated explicitly in the objective.

Content validity is achieved when your assessment items require precisely the same learning outcomes and contexts that your instructional objectives specify In other words, design questions and tasks so that the knowledge, skills, and situations learners were expected to master are the ones actually being tested By ensuring this alignment between learning objectives, test items, and the learning conditions, you create assessments that truly measure what students were intended to learn, strengthening the overall validity of your evaluation.

The cognitive domain is one of the three areas in the taxonomy of educational objectives and provides a framework for developing a comprehensive set of instructional objectives It concerns the kinds of performances students are expected to demonstrate, such as knows, comprehends, and applies, as noted by Wiersma By defining these cognitive objectives, educators can align instruction and assessment to promote higher-order thinking and clearly specify learning outcomes.

Bloom’s taxonomy is the most widely employed framework for labeling and articulating levels of cognitive processing in test construction, and it is especially useful for identifying the learning outcomes to consider when developing a comprehensive list of objectives for a unit or course In the cognitive domain, the categories include Knowledge (remembering), Comprehension (understanding), and Application (applying previously learned information in new contexts).

"Analysis’- at which objectives require the student to identify logical errors or to differentiate among facts, opinions, assumptions, hypotheses, or conclusions;

Bloom's taxonomy outlines six cognitive levels that progress from Knowledge to Evaluation Synthesis requires learners to produce something original by combining ideas in new ways, while Evaluation asks students to judge the value or worth of methods, ideas, people, or products for a specific purpose The levels—Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation—form a ladder of increasingly complex cognitive skills, a structure Linn and Miller (1995) describe as ranging from the simplest to the most complex.

Administration

At this stage, the aim is to evaluate the test’s usefulness and the inferences or decisions it is designed to support This is accomplished by administering the test to a sample of individuals, collecting their responses, and analyzing the resulting data to inform conclusions and practical decisions.

Bachman and Palmer (1996) describe four essential steps in test administration: first, preparing the testing environment to match the test blueprint specifications—covering location, materials and equipment, personnel, timing, and the physical conditions under which the test is given; second, delivering instructions that are clear and understandable for all test-takers; third, maintaining a supportive, distraction-free testing atmosphere throughout the session by controlling factors like temperature, noise, and movement; and fourth, collecting the tests with an opportunity for test-takers to provide feedback or discuss their experience with proctors Together, these steps help ensure the test is administered effectively and its usefulness is enhanced.

In summary, a test development process should be organized into three stages: the

In the design stage, essential background information is articulated to lay the groundwork for subsequent test development, while operationalization concentrates on creating task specifications for each type of test item and a blueprint that shows how tasks are sequenced to form complete tests These initial steps function as a specification for the entire assessment, detailing what the test intends to measure and how it will be measured They also help ensure that every item targets a specific skill, concept, or knowledge and is measured in the same way and at the same level of difficulty across administrations, so the results best reflect students’ language competence Accordingly, the test’s validity—especially face and content validity—depends on the extent to which the specifications and test content align with the language skills and structures outlined in the blueprint The final administration stage includes activities designed to create an appropriate psychological atmosphere, which supports reliability of the test results.

From all the analysis above, it can be concluded that if a thorough test development procedure is followed, the validity and reliability of a test is likely to be very high.

D escription of th e su b je c t s

D ata collectio n in stru m en ts

R esu lts

D iscussio n

S u g g estio n s

RESULTS AND D ISCU SSIO N

Tiêu đề	An evaluation of the final achievement test of basic English for first year non-English major students at Bac Giang Teachers’ Training College
Người hướng dẫn	Nguyen Thi Nhu Hoa, MED
Trường học	Hanoi University
Chuyên ngành	TESOL
Thể loại	thesis
Năm xuất bản	2007
Thành phố	Hanoi

Định dạng
Số trang	140
Dung lượng	9,67 MB