Improving test performance through a language test evaluation cycle by Richard Blight

Curriculum Innovation, Testing and Evaluation: Proceedings of the 1st Annual JALT Pan-SIG Conference.
May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

Improving test performance through a language test evaluation cycle

by Richard Blight (Ehime University)

How can teachers assess the effectiveness of a language test in a specific context? How can the performance of a language test be improved to more effectively achieve specific program goals? Teachers rely on testing instruments as a matter of routine classroom practise. However, there are inherent difficulties in the process of measuring learning progress, which directly affect attempts to assess the effectiveness of language testing instruments (McNamara, 1996, p. 2). Research is consequently required into developing effective means for improving language test performance.
The concept of an evaluation cycle has previously been applied to developing language learning tasks (Breen, 1989), and is usefully extended to apply to the development of language tests. The language test evaluation cycle discussed in this paper is concerned with developing testing instruments to better meet the specific needs of local learning contexts. Essential components of Breen's model which can be directly adapted include the investigation of test administrations in the classroom, and the measurement of learner performance against specified test criteria (1989, p. 193). Test specifications play an important role in this process since they "force explicitness about the design decisions in the test and . . . allow new revisions to be written in the future" (McNamara, 2000, p. 31). The evaluation process hence first involves reviewing the relationship between test specifications and the specific objectives of a language program.

Test specifications and program objectives

Test specifications, including formal statements of performance criteria, represent decisions made during the test design process concerning how to effectively operationalize theoretical constructs (Bachman & Palmer, 1996, p. 87). It is consequently important to recognise the significant role that the specifications also play in the evaluation process, as discussed by McNamara:

even though . . . tests in general performance assessment typically do not make explicit reference to a theory of the underlying knowledge and ability displayed in performance, a theoretical position is implicit in the criteria by which raters are to make judgements. (1996, p. 19)

The specifications should not, however, be regarded as being finally determined in the test construction stage. Rather, they should be reviewed (with other aspects of the test) in terms of evaluative feedback on the usefulness of the testing instrument within a specific context of use (Bachman and Palmer, 1996, p. 87). Significant improvements can subsequently be made to tests "in the light of their performance and of research and feedback" (Alderson, Clapham, and Wall, 1995, p. 218). A beneficial evaluation cycle would consequently involve stages of analysing test performance in the learning context, devising appropriate revisions, and evaluating the revised test (e.g., see Cohen, 1994, p. 101-112). Few teachers are, however, able to undertake complex test evaluation procedures during the course of their regular teaching programs.

[ p. 125 ]

In this paper, a communicative language test is evaluated in order to explore the types of issues that may be encountered in the evaluation process. Test specifications are reviewed against design principles and communicative language teaching goals. Professional judgements are made concerning the value and purpose of various aspects of the test, with a view to developing an improved testing instrument. The test is revised in order to address problem areas in the test performance. It is hoped that teachers can apply similar evaluation cycles to specific learning contexts in order to improve the performance of testing instruments.

The curriculum framework

The test used in this study was developed in an English language teaching institution in Australia and administered to adult migrant students. The institution is an authorized provider of the Adult Migrant Education Program (AMEP), a comprehensive national English language teaching program developed and administered by the Australian government to provide English instruction to immigrants. Associated with the language teaching program is a series of nationally recognised certificates (Certificate in Spoken and Written English, or CSWE). Each level of the CSWE requires achievement of sets of discrete competencies. As part of the national curriculum, detailed specifications (Adult Migrant Education Service, 1995) are provided for each competency. The specifications are divided into a number of content areas: Elements (essential linguistic features, knowledge relevant to the content, context requirements), Performance Criteria (statements about the learner's performance in the language interaction), Range Statements (conditions or parameters to be associated with the assessment task), Evidence Guides (suggestions for tasks which could be used to assess the competency), Benchmark Performances of learners' assessments (accompanied by specific grading information at various levels), and the Moderation process (assessors participate in moderation sessions for the purpose of developing expertise in assessment determinations).
Assessment in the CSWE system is criterion-referenced, whereby "individual performances are evaluated against a verbal description of a satisfactory performance at a given level" (McNamara, 2000, p. 64). In contrast to a normative system in which numerical scores are allocated to test results, students' work is instead measured against the performance criteria provided for each competency. Students are assessed in terms of whether or not their work demonstrates the performance criteria at an appropriate standard. If their work successfully demonstrates all the performance criteria, they achieve the competency and progress to studying another competency on the certificate. Alternatively, they continue working to achieve the same competency in future classes. The CSWE framework incorporates a number of design features aimed at establishing validity and reliability in tests. Performance ratings are standardized based on samples of performance benchmarks (provided as part of the curriculum framework) for each competency, combined with mandatory teacher training in the moderation process (also as part of the curriculum framework). The CSWE assessment procedure hence involves comparison of student performances against performance benchmarks which illustrate appropriate standards for the performance criteria.

[ p. 126 ]

The test evaluation process

Students on the AMEP are generally motivated to achieve CSWE certificates for purposes of future employment and further study. A primary teaching objective is consequently to assist students to gain their certificates by achieving the required sets of competencies, and this objective also provides a meaningful purpose to the current evaluation process. The performance of a test can be beneficially considered in terms of results achieved in the classroom. Problem areas can be identified and modifications subsequently developed which aim to improve future test results. The evaluation process consequently requires deliberation in a number of important areas. Why didn't more students perform better on certain performance criteria? To what extent was the test appropriate for the learners and the language program? Were various aspects of the test valuable towards achieving the program goals? Did the pre-teaching stage achieve its designated purposes? How representative were the learners of a typical class in the same course? The test evaluation process also typically considers a range of general areas relating to task performance: level of difficulty, task clarity, timing, layout, degree of authenticity, amount of information provided, and familiarity with the task format (Weir, 1993).

Description of the testing instrument

Teaching institutions selected to provide the AMEP generally develop testing instruments in order to meet both the national curriculum framework and specific institutional needs. The test selected for this study is an example of a communicative writing test administered at the upper-intermediate level in a contemporary teaching program. It was developed in-house at the Adelaide Institute of Technical and Further Education for assessing Competency 14 of the CSWE III. Competency 14 requires students to write a short formal letter of about 100 words in a one hour time period. Students can use dictionaries and may draft and self-correct the letter as long as a completed version is submitted at the end of the time period. The testing instrument describes a situation in which the student had recently purchased a computer which appeared to have a serious operational fault. Specific information is provided for the type of computer, the company where the computer was purchased, and the technical fault. Students are required to write a formal letter to the company explaining the situation and requesting appropriate action and assistance. A copy of the sample test is provided in Appendix 1.

Identifying problem areas in test performance

A class of eleven adult migrant students enrolled on the AMEP and working towards achieving the level three certificate were selected for the study. The students were first instructed concerning the performance criteria and completed some practice tasks, according to standard teaching practice. The test was then administered, and the students were assessed against the performance criteria and the benchmark performances. Each student's work was assessed in terms of whether each criterion was appropriately demonstrated. Students who demonstrated all the criteria were awarded the competency. After completing the assessment process, a table was compiled which listed each student's results (success or fail) against the performance criteria. Totals and percentages were subsequently calculated which identified the proportion of the class achieving each criterion. A summary table was produced (see Table 1) in order to provide a quantitative basis for evaluating the test's performance.

[ p. 127 ]

Table 1: Summary of test performance criteria vs. class results.

Performance Criteria (Adult Migrant Education Service, 1995)	Class Results (N=11)
follows conventions of layout for formal letter	100% (N=11)
stages text appropriately -beginning, middle, and end	100% (N=11)
writes paragraphs which clearly express objective information about situations / events	73% (n=8)
provides information / supporting evidence to substantiate the claim	27% (n=3)
makes a request for specific follow-up action	27% (n=3)
uses appropriate conjunctive links e.g., causal, additive, temporal, conditional, as required	73% (n=8)
uses appropriate vocabulary to reflect the topic	36% (n=4)
uses appropriate politeness / level of formality	91% (n=10)
uses grammatical structures appropriately	64% (n=7)

The summary served to identify which performance criteria caused students some difficulty. Low percentages of students achieving a criterion were considered indicative of potential problem areas in the test performance. Three criteria were demonstrated by the entire class ("follows conventions of layout for formal letter", "stages text appropriately- beginning, middle, and end", and "makes a request for specific follow-up action").Another criterion was demonstrated by most students ("uses appropriate politeness / level of formality"). Three criteria were demonstrated by many students ("writes paragraphs which clearly express objective information about situations / events", "uses appropriate conjunctive links e.g., causal, additive, temporal, conditional, as required", and "uses grammatical structures appropriately"). Finally, two criteria were demonstrated by just a few students ("uses appropriate vocabulary to reflect the topic", and "provides information / supporting evidence to substantiate the claim").

Investigating the problem areas

The students' tests were next reviewed in terms of the two performance criteria which caused major difficulty. Instead of providing supporting evidence to substantiate the repair claim, it was found that many students simply copied verbatim the description of the computer fault from the test instructions. They also appeared to lack appropriate vocabulary resources to be able to describe the situation in some detail. The subject area of the test was viewed as being a prime cause in these two problem areas, since most students lacked the technical expertise to discuss the computer problem sufficiently to substantiate the repair claim.
A difficulty with the measurement process was also identified at this stage of the evaluation. While the curriculum framework provides for standardized assessments through a regulated training program, the requirement for simple ratings (success / fail) of the competency statements was found to be problematic. Accurate determinations of subtle differences between student performances appeared to require a very high level of expertise, and it was not clear whether this had been adequately established by the teacher training process. Furthermore, the significance attached to rating according to just two categories did not appear to fairly represent the varied range of performance standards demonstrated during the test. Marginally different performances could result in significantly different assessment results. However, this problem area is associated with criterion-referenced assessment in general, rather than with the current study.

[ p. 128 ]

Summary evaluation results
The test was practical and efficient in its initial administration. The test specifications, including the performance criteria, were relevant in determining a useful communicative writing task for the current group of learners. The test was, however, considered to be limited in a number of areas that were targeted for subsequent improvement. A major problem was evident in the presumption of subject knowledge and technical vocabulary associated with using a computer, and students were mostly unable to achieve two performance criteria on this account. The task was also made unclear by using an ambiguous technical term ("backfiles") to describe the computer fault. Furthermore, informal feedback received directly after the test administration indicated that some learners disliked using computers (or were at least partially technophobic), and reacted negatively to the task on this basis alone. Finally, the wording of the task was insufficiently clear about the requirement to discuss the situation in some detail in order to substantiate the repair claim. The validity and reliability of the testing instrument were negatively affected by these problem areas.

Revising the test

The next step in the evaluation cycle is to make revisions in order to address the problem areas. Firstly, the learning domain should be extended to include relevant computer terminology prior to the test administration. A general level of familiarity could, for example, be presumed if this test was sequenced after computer sessions (e.g., word processing classes, CD-ROM classes) had also been introduced into the curriculum.
A revised testing instrument was also developed (seeAppendix 2). Since the students were recent immigrants to Australia, and usage of personal information is also a content area in the curriculum, it would be beneficial for students to use their own names and addresses, and today's date, rather than using an anonymous third person's details. The test becomes a more realistic writing task (and a more authentic communicative activity) when personalised in this manner. Also, while students should discuss their feelings in order to substantiate the repair claim, leading statements ("you are disappointed ...") should not be used in the instructions. Rather, students should be required to describe their own emotional response to the situation. The description of the computer fault ("backfiles") should also be revised, since this is unclear from a technical viewpoint. And the task description should be reworded to explicitly state the requirement for discussing the situation in some detail in order to substantiate the repair claim. Since one criterion requires students to provide supporting evidence, this should be made clear in the task description.
Finally, the testing procedure would be improved by collecting feedback from students, as discussed by Bachman and Palmer: "low-stakes tests can be improved by planning to use them over an extended period of time and collecting feedback on usefulness during each operational administration" (1996, p. 246). This point was particularly evident from the significance of comments made by students directly after the initial test administration. A survey should be developed for this purpose in order to complement the current testing procedure.

[ p. 129 ]

Conclusions
Communicative language tests can be evaluated in terms of their performance within specific learning contexts. The evaluation process involves analysing test results in light of both test specifications and program objectives. The test should subsequently be revised in order to address any problem areas. The effectiveness of the modifications should then also be evaluated as part of a continuing test evaluation cycle. In this paper, a sample language test has been evaluated and a range of modifications developed with a view to improving the test's performance within a communicative teaching context in Australia. A number of problem areas were identified during the evaluation process. In each case, consideration of the value and purpose of various aspects of the test specifications and program objectives was beneficial to devising the modifications. It is recommended that language teachers should implement similar test evaluation cycles in order to improve the performance of communicative testing instruments.

References

Adult Migrant Education Service. (1995). Certificate in spoken and written English III. Sydney: Author.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.

Breen, M. (1989). The evaluation cycle for language learning tasks. In R. K. Johnson (Ed.), The second language curriculum (pp. 187-206). Cambridge: Cambridge University Press.

Cohen, A. D. (1994). Assessing language ability in the classroom. Boston: Heinle & Heinle.

McNamara, T. (1996). Measuring second language performance. Harlow: Addison Wesley Longman.

McNamara, T. (2000). Language testing. Oxford: Oxford University Press.

Weir, C. J. (1993). Understanding and developing language tests. London: Prentice Hall.

2002 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 130 ]