An approach to the validation of judgments in language testing

Proceedings of the 2nd Annual JALT Pan-SIG Conference. May 10-11, 2003. Kyoto, Japan: Kyoto Institute of Technology.

by GholamReza HajiPourNezhad Jordan University

Abstract

In every testing effort, we make decisions and choices under the influence of subjective judgments. Although great logical, psychometric, and empirical efforts have been made in the profession to avoid basing test construction on subjective views, it seems that subjectivity is an inevitable aspect of language test planning, construction, and validation. This paper reviews three main solutions or attempts to reduce subjectivity in language testing and argues in favor of a pooled expert judgment approach to test construction.

Keywords: test construction, testing validation, test ethics, content validation, judgment pooling, subjectivity

$Ž¿–âAŽ¿–â•û–@A—¬’¨A˜b‚·‹Z”\$

Subjective judgments constitute a significant portion of any language testing endeavor. As Pilliner (1968) notes, there is no means to avoid subjective judgments even in tests which appear to be objectively designed, scored, and interpreted. Subjective judgments are so pervasive that they exist in all testing related areas (Alderson, 1993). Due to this ubiquitous impact, two equally knowledgeable testing authorities with different theoretical and/or practical backgrounds may come up with diametrically different decisions at each stage of testing. Moreover, each expert will probably be able to justify her/his decisions with self-corroborating "logical", "theoretical", or "empirical" evidence. Interestingly, each party can be bewildered if the other does not discern certain "straightforward facts".
If we come to accept the view that each theoretical position results from an amalgam of judgments and choices, we might begin to assume that the views held by each testing professional are relativistic and based on a torrent of previous judgments. Though this relativistic perspective has a long tradition, it might seem untenable from empirically-oriented scientific perspectives. However, in recent decades the basic nature of scientific objectivity and validity has been a topic of considerable inquiry (Knorr-Cetina, 1981). Under most circumstances, testing specialists in the same camp share many common beliefs as well as some differences of opinion. In fact, there are no two experts who think in identical ways about everything. This is partly due to the fact that there are no absolute criteria to base decisions on. In every decision making process, we normally start by gathering evidence. However, in the humanities it is rare to find sufficient evidence to "prove" ideas. As a result, after collecting bits of evidence, we still have to make decisions regarding which data to accept and which to reject.
Examining how closely judgments are inherent in all phases of each testing effort, it may be useful to recollect the stages we went through when last constructing a test. The very first step might have been to start reading (or reviewing) a relevant book or article. This selection process is judgment-based and the material read may influence test outcomes. Next we perhaps started considering how to specify the candidates' needs and formulate test contents. We might have talked to some colleagues to gather information, or simply continued reading about various aspects of each task. Each of these steps is judgment-based and impacts test outcomes. Throughout the process of test development, we either start from a theoretical background and later try to realize it in the form of test content, or start from test samples and try to find out if there are "appropriate" theoretical grounds for them. All of these processes are mainly judgment-based. Whether we take an inductive or deductive approach, we are making a variety of judgments. As a matter of fact, in sampling the kind of language, test content, subjects for the study, test tasks, test methods, and statistical procedures to determine reliability and validity and the like, we are bound to make subjective decisions for which, in the final analysis, there is no "definite evidence".
[ p. 80 ]

Drawbacks

Despite the fact that we have to form judgments throughout the entire language testing process, the judgments made by each testing expert are inherently not totally tenable. We can enumerate some of the drawbacks here:

Experts lack unanimity about many key issues. In other words, different experts have different judgments regarding issues such as the difficulty level of test items, what a test item measures, and so on (Alderson, 1993). With no general agreement - or in certain cases, even with no consensus - how do we know whether or not one judgment is any better than the others?

Judgments made by each testing expert are the outcomes of her/his individual theoretical and/or practical backgrounds (Brown, 1996: 235). It would be very strange to notice a structuralist advise someone to employ Oller's (1979) pragmatic expectancy grammar through dictation, or for an advocate of communicative testing to suggest utilizing grammar-based discrete-point test items. This clearly suggests that judgments are relative to one's stance.

The form and the intensity of judgments varies according to the expert's knowledge. The results of an earlier study (Hajipournezhad, 2002) indicate that the more theoretical knowledge a group of experts have in a certain area of testing, the less extreme their judgments are and the less inter-expert variance there is (approaching a platokurtic bimodal distribution). Therefore, judgment types are directly influenced by knowledge levels.

Under most circumstances, test developers who make judgments do not quantify them. They make them once and merely employ them to take some particular action or to underlie a further judgment. It is seldom possible to retrace each step in the judgment making process and to quantify that step in terms of percentages or a scale (Brown, 1996, p. 235). This makes the judgments of individual testers almost untraceable.

If it is not possible to rate the degree of the judgment(s) which underlay a particular thought or action, another shortcoming of judgments appears. The magnitude of what we may name "translation validity" - the degree of translating desired needs into test tasks depends on the quantification of decisions and on mapping them onto subsequent judgments and actions. Therefore, translation validity (which is in ways similar to content relevance) will seldom be dependable.

Because there is usually only one person involved in making judgments, cross-analysis is insufficient. Judgments are made and clear information about the steps made to reach those judgements is usually obliterated in the course of time. Therefore, the likelihood of fallacy is increased.

Current Solutions

With these mentioned shortcomings, several attempts have been made to compensate for the subjectivity inherent in test constructor's judgments. In most testing undertakings, it is very frequently emphasized that subjectivity has to be avoided. Three widely used attempts to overcome subjectivity are mentioned here.

1) Validation, in general

". . . most studies ignore the process of how a test is constructed, and take no notice of the judgments underlying the stages of its construction process."

Validation is, as frequently stated in testing textbooks, the most important consideration in test development. A recent enumeration of various validity terms amounted to dozens of them on the LTEST-L mailing list. In fact, there are more than thirty validity terms referring to different aspects of validity under construct validation alone. The sole purpose of all these aspects of validity is to find out whether a test measures what it purports to measure. All too often, measurements are contaminated by other attributes. Therefore, if the results of a validation study show that a test is uncontaminated for its particular use or interpretation, the test is considered valid. In recent years, hundreds of studies have been done on validation with numerous useful findings. However, one problem is that validity measures only focus on the 'validity of measures as products'. There has been almost no attempt to validate the decisions and judgments made in the process of test development and validation. This means that most studies ignore the process of how a test is constructed, and take no notice of the judgments underlying the stages of its construction process. The validation process may not reveal the internal structure of a test or a test item, which is a direct outcome of the judgments underlying its construction. Given the point that the decisions and judgments underlying test construction have a tremendous role in shaping the minute details of the tests, shouldn't validity measures assess these effects? If not, it is worth exploring why.
[ p. 81 ]

Furthermore, most validity measures (particularly those in the realm of construct validation) are statistically-based, and it goes without saying that statistics without logic leads nowhere. However, where does logic come from? Generally, logic is expected to reside in the way an individual's test is constructed and later validated. Therefore, in language testing, logic is all too often a euphemistic term for a series of well contemplated subjective judgments which are usually individually-based. The purpose of this discussion is not to undervalue the significance of the validation studies conducted so far, but to touch an area of validation which has been mostly ignored. In Messick's (1980, 1988) progressive matrix of validation, we have to be careful about evidential and consequential justifications for the interpretation and use of a test. It is actually impossible for one language test constructor to come up with a near-perfect picture of the value implications and the social consequences of a test. However, the usual practice is to develop tests on a variety of subjective judgments in precisely this way, and only later, to try to discover or possibly rectify the situation. Most approaches to validity have up to now been a posteriori, while we need to consider an on-line validity measure; one which focuses on the judgments acting as cornerstones for the development of a test while the test is being constructed.

2) Face validity, in particular

Face validity, which is considered as surface, false, or pseudo validity, is now almost eliminated from validity measures (Bachman, 1990, p. 287). It originally aimed to incorporate the judgments of other laymen or experts into the final decisions on the "apparent validity" of a test. There has been considerably antipathy toward this term among many scholars. For instance, Mosier (1947) states:
The concept is the more dangerous because it is glib and comforting to those whose lack of time, resources, or competence prevent them from demonstrating validity (or invalidity) by any other method. . . .This notion is also gratifying to the ego of the unwary test constructor. (p. 194)

Others have focused on its technical shortcomings. Anastasi (1988) states:
Content validity should not be confused with face validity. The latter is not validity in the technical sense; it refers, not to what the test actually measures, but to what it appears superficially to measure. Face validity pertains to whether the test "looks valid" to the examinees, who take it, the administrative personnel, who decide on its use, and other technically untrained observers. (p. 144)
Most researchers have defined face validity as pertaining to a superficial examination of a test by non-experts. For instance, Cronbach (1971) believes that face validity is vague and subjective, Ingram (1977, p. 18) defined face validity as 'surface credibility or public accountability', Alderson et al (1995, p. 172) consider it holistic rather than analytic. Cronbach (1984, p. 182) also indicates "a test that seems relevant to the lay person is said to have 'face validity'". Above all, Fink (1995) dismisses face validity due to the point that it does not depend on established theories for support.
However, Roberts (2000) emphasizes that we do not have to follow this kind of conceptualization of face validity. He argues that we can use face validity with expert informants and not as a superficial examination. This, he believes, will turn face validity into a more effective measure. Furthermore, there are those scholars who support the application of face validity as it caters to response validity by enhancing applicants' acceptance of the testing procedure (Alderson et al, 1995, p. 173; Davies et al, 1999, p. 59). In fact, face validity has gained a new status in the framework of communicative language testing which favors this type of validity due to its 'real-life' definition of language proficiency (Carroll, 1985).
Now, face validity, although having a positive effect on response validity and public relations, is not without real problems. First and foremost, like all the measures of validity, face validity is product-oriented. That is, the informants, whether be experts or laymen, consider the test only after it has been constructed. They have no chance to get familiarized with the assumptions underlying the specific test construction. Second, face validity claims to come up with "valid reports" by informants. That is, the ones who have been consulted often have a tendency to report the test as valid.
Third, face validity efforts come up with qualitative reports of whether the test represents what the given constructs demand. In other words, quantitative face validity, which uses rating scales and planned interviews is rarely practiced. Fourth, Face validity, as normally practiced, involves superficial examination of test content. This leaves room for no in-depth analysis of test content compared with test constructs (translation validity). Fifth, face validity is a kind of judgment call which does not follow the principles of judgment surveys in general. Test developers usually ask one or two or very few observers to face validate the test. However, when conducting most face validity studies, often no appropriate statistical procedure is applied to the data to make the subsequent decisions statistically meaningful and dependable. Sixth, and still very important, is the tendency of many test constructors to use face validity as sufficient grounds for validity, obviating the need for other measures of validity. Due to these problems, I would like to emphasize that face validity has not theoretically or practically served language testing well.

3) Moderation

Davies et al (1999) define moderation as
A process of review, discussion and evaluation of test materials or performance by a group or committee of language testers, raters, teachers and/or other experts. (p. 122)
The moderation process can focus on test content/content validity or on judgments made about candidates and their performance. Alderson et al (1995, p. 63-4) focus on how test items should be moderated informally at first and later by a formal committee. They also mention some of the pitfalls of moderation committees. Kindler (1996) describes moderation as a 'quality assurance' model. Generally speaking, whatever the type of moderation, its main objective is for testing experts to reach agreement on certain issues.
[ p. 82 ]

Undoubtedly, moderation has been a great help to the decision-making process in language teaching and testing situations. It has indisputably been an effective means of making instructional/assessment decisions. Nevertheless, moderation has its own drawbacks. I believe there are two issues which limit this method in overcoming subjectivity.
First, what happens if a member of the moderation committee who holds a particular stance regarding language testing theory and practice is faced with an opposing view? Naturally, s/he would try to convince others. There are two problems here. First, s/he may lack a comprehensive knowledge of opposing views. Therefore, his/her ability to discuss the matter in depth might be limited. How is it possible to make a fair choice or judgment without a good knowledge of an opposing viewpoint? For example, can one justly claim that communicative testing is more productive than integrative testing if one lacks sufficient knowledge of the advantages and disadvantages of integrative testing? Hence the moderation process can degenerate into a polemical debate rather than a balanced, fruitful negotiation.
Second, moderation is, like face validity and other validity measures, an a posteriori approach to test design. It mostly focuses on test content and test tasks. It is, in fact, a more refined form of face validity in this respect.
In the light of these shortcomings, I argue that although all of the methods of overcoming subjectivity have their advantages in terms of test development and design, none has managed to control for the kind of judgments which are fed into the language test construction, validation, and interpretation process.
At this stage, I will present a model to investigate judgments in language testing and mention how a method to reduce subjectivity in test affairs can be applied. This expert judgment pooling approach has been used by researchers such as Brown, Ramos, Cooke, and Lockard (1990) and was designed and utilized by Hajipournezhad (2001) to investigate the judgment decisions made by testing experts in Iran and also to consider whether those judgments could be used for the construction of a standard proficiency test in Iran. The actual test, which was developed through this process will be described in a subsequent paper. In this paper, the model used for the development of that test will be underscored.
Suggested Model

The suggested model for the validation of judgments is based on the following principles:

Judgments are unavoidable in constructing and validating language tests.

No individual testing expert can make 'perfect' judgments. Group judgments (with certain limitations) are the key to near-perfect judgments and decisions.

Pooled judgments by testing experts should be expressed in quantifiable terms to enhance precision.

Pooled judgments about each issue should include opinions about alternative options and "consensus" should be ascertained on the basis of clear statistical measures.

An a posteriori validation of a test (after it is constructed) cannot control for all the major (subjective) judgments made in the process of test construction. Many judgments influence how a test is constructed, and how it influences candidates' performance on the test without being trapped by retrospective test validation.

Informed pooled judgments should be utilized in all aspects of testing, not merely in the moderation of test tasks or candidates' ability level.

Since test tasks should reflect target situation needs, pooled judgments should initially focus on the identification of those needs at both theoretical and practical levels. Expert opinions should be utilized from the outset in the test development process. This involves the identification of expert assumptions about each particular testing situation. If these assumptions are not investigated, they will influence test construction process without being noticed.

Round-table pooled judgments should be replaced by face-to-face individual judgment interviews to avoid the first shortcoming mentioned above for moderation.

A list of language testing assertions (with the assertions which are considered correct, incorrect, and under dispute clearly listed should be constructed and utilized for the evaluation of test constructors judgments and knowledge. These assertions can be gathered from the testing experts acting as informants of the study and/or from the researcher's own assertions.

[ p. 83 ]

"The main purpose of this article is to highlight the advantage of a pooled expert judgment approach to test construction."

Eighty experts were involved in developing a test based on these principles recently in Iran. Detailed information about that test and how it differs from other test will be forthcoming. The main purpose of this article is to highlight the advantage of a pooled expert judgment approach to test construction. The author maintains that such an approach overcomes most of the limitations inherent in a posteriori test validation procedures. By utilizing the pooled judgments of a large body of language experts, better information about how to plan, construct, validate, and interpret tests can be obtained.
References

Alderson, J.C., (1993). Judgments in language testing. In D. Douglas & C. Chapelle (Eds.), A new decade of language testing. Alexandria, VA, USA: TESOL.

Alavi, S.M. & Hajipournezhad, G., (2001). Validation of the T-test. Tehran ELT Journal, 33 (4) 24-35.

Alderson, J.C., Clapham, C., Wall, D., (1995). Language test construction and evaluation. Cambridge University Press.

Anastasi, A., (1988). Psychological testing. London: Macmillan.

Bachman, L.F, (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Brown, Ramos, Cooke, & Lockard (1990) study cited on p. 220 of J.D. Brown & T. Hudson (2002). Criterion Reference Testing. Cambridge: Cambridge University Press.

Brown, J.D., (1996). Testing in language programs. Prentice Hall Regents. Prentice-Hall, Inc.

Carroll, B.J., (1985). Second language performance testing for university and professional contexts. In P. C. Hauptman, R. LeBlanc & M.B. Wesche (eds.), Second language performance testing. Ottawa: University of Ottawa Press.

Cronbach, L.J., (1971). Validity. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, D. C.: American Council on Education. (443-597).

Cronbach, L.J., (1984). Essentials of psychological testing. Fourth edition. New York: Harper and Row.

Davies, A., Brown, A., Elder, C., et al (1999). Dictionary of language testing. Cambridge University Press.

Fink, A., (1995). The survey handbook. The survey kit, Vol. 1. Thousand Oaks, California: Sage Publications Inc.

Hajipournezhad, G., (2002). Which one speaks louder in language testing, actions or words? In M. Swanson, D. McMurray & K. Lane (Eds.) Proceedings of the Nov. 23-25, 2001 JALT Conference in Kita Kyushu, Japan. Tokyo: JALT. 805-811.

Kindler, J. (1996). Moderation: what it is and why we have it. Discussion paper. Melbourne, Australia: Adult Basic Education Resource and Information Service.

Knorr-Cetina, K.D. (1981). The manufacture of knowledge. Oxford, UK: Polity Press.

Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer & H. Braun (Eds.). Test Validity. (pp. 33-45). Hillsdale, NJ: Erlbaum.

Mosier, C.I., (1947). A critical examination of the concepts of face validity. Educational & Psychological Measurement, 7, 191-205.

Oller, J.W., (1979). Language tests at school: A pragmatic approach. London: Longman.

Pilliner, A.E.G. (1968). Subjective and Objective Testing. In A. Davies (ed.). Language testing symposium: A psycholinguistic approach. (pp. 19-35). Oxford University Press.

Roberts, D.M., (2000). Face Validity - Is there a place for this in measurement? Shiken: JALT Testing & Evaluation SIG Newsletter, 4 (2) 5. Retrieved from the World Wide Web at jalt.org/test/Roberts_1.htm on July 20, 2003.

2003 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 84 ]