Rasch & quality control: Controlling data, forgetting quality?

Second Language Acquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference.
May. 12 - 13, 2007. Sendai, Japan: Tohoku Bunka Gakuen University. (pp. 75 - 83)

Rasch & quality control: Controlling data, forgetting quality?

by Gerry Lassche (Miyagi Gakuin Women's University)

Abstract

This paper contains my reaction to several presentations on the use of the Rasch analysis at the 2007 Pan-SIG conference. My contention is that dependence on analyses, while technically sound, may have created a false sense of test validity. By referring to the works of Messick and Bachman, I first summarize this testing framework, and then apply it to the various presentations given at the conference. In general, I felt that practitioners were placing too much faith in a stats-driven effort, without giving enough prior consideration to theoretical assumptions. As a result, the practice led to what I believe were pedagogically questionable testing practice.

The Testing and Evaluation portion of the 2007 Pan-SIG Conference offered a platform for showcasing the use of Rasch analysis in several language teaching contexts. Rasch analysis provides a structure that data needs to fit in order for measurements to provide some measure of dependability; it is a mechanism for adjusting candidate scores across items and raters, so that said candidates can be treated more fairly (personal communication with Charles Adamson, 2007). Dr. Bond, in his plenary address, contended that Rasch analysis is essential for quality control for classroom testing. He stated further that teachers who did not reflect on their testing practices, through such techniques as the Rasch analysis, should not bother giving tests at all, or were being willfully negligent.

". . . while Rasch analysis may be useful for controlling data, it does not have anything to say about the quality of testing practice, or validity, and cannot be a stand-in proxy for such validation procedures."

The basic point I wish to emphasize is while Rasch analysis may be useful for controlling data, it does not have anything to say about the quality of testing practice, or validity, and cannot be a stand-in proxy for such validation procedures. On the other hand, what alarmed me during the conference was the confidence presenters seemed to place in the results that Rasch provided, when their testing practices seemed to be disconnected to currently accepted test development processes.

I will begin this conversation with a very brief summary of test development, and how Rasch "goodness of fit" should be distinguished from validity. I will then describe examples of testing practices that were presented as "dependable" (cases 1 and 3), and one that was not okay (case 2). Finally, I will discuss why I feel deeper questions regarding validity were critical in these contexts if better testing paradigms are to emerge.

[ p. 75 ]

Test development theory

A recurrent theme in Bachman and Palmer's (1996) seminal book on testing practice is that testing should be developed in three broad ranging steps: construct definition (what do you want to measure), construct interpretation (how are you going to measure it), and construct realization (did you actually measure it after all).

Construct definition

In this stage, the tacit beliefs that stakeholders have about a construct need to be described. This needs to proceed with respect to clarity, agreement among stakeholders, and usefulness in developing test tasks/items. For example, if my school was to develop an exit exam to evaluate student's learning achievement, we would first need to have developed a curriculum statement, which defined what language skills and abilities students would be expected to attain during their coursework. In other words, the intangible beliefs of what is worth teaching, of what should be taught, as reflected in each teacher's course design, would be clarified and aggregated into a single test which reflected this knowledge / performance domain. As curriculums are different from school to school, obviously entrance and exit tests would change also. As Messick (1988, p. 66) said, measuring what a student knows and what he can do as a result of teaching intervention is a critical endeavor that needs to be concerned with curricular relevance, such that test item characteristics correspond to the various program/curriculum objectives.

Construct interpretation

Next, in the construct interpretation stage, test designers need to operationalize the language construct that they intend to measure. The language which describes the construct should be evaluated: do all the stakeholders view the language in the same way? While 100% agreement is not possible in the real world, general consensus is a baseline for at least getting on with task of test design.

I wonder if general consensus is sometimes superficially possible, for example, with the use of band scores because of inherent ambiguity which characterizes their language content (for discussion, see Lassche 2006). The C1 band-level descriptor for spoken language from the Council of Europe (2001) reads:

Shows great flexibility in reformulating ideas in differing linguistic forms to convey finer shades of meaning precisely, to give emphasis, to differentiate and to eliminate ambiguity.

Ah, the devil is in the details! What does "great" mean? How many forms are required for "differing"? How precise is "precisely"? How does one define relative "emphasis", differentiation", and "ambiguity"? In order for this test to have construct validity, I believe the meaning in these various terms needs to be unpacked, so that stakeholders can indeed be in agreement about what they refer to.

What is possible, for example, is that a student could get a similar score from two different raters. Now, the raters may be using completely different interpretations of the criterion to arrive at their similar scores. Well and good for the student, and for the administration, who may only care about the score as a "number" anyway. As long as these different interpretations consistently correlated, Rasch analysis would not discriminate one rater as more or less severe than the other. This might lead to the conclusion that the raters are interpreting the construct in the same way, when in fact, they are not. For the rough and ready environments of most schools today, such an outcome is fine, because the student can be assigned a "number" that everyone "agrees" on. But, does education really benefit from such a case? In other words, does concurrent validity matter?

Construct realization

Finally, in the construct realization stage, do the characteristics of the test tasks and items reflect the construct definition in practice? This is usually an analysis after test trialing has occurred, and feedback from stakeholders has been elicited.

[ p. 76 ]

Rasch analysis

I wish to quote my (Lassche, 2006) summary of "reliability":

That the test obtains samples of consistent individual peformance while minimizing irrelevant variation is a measure of reliability (Hegelheimer and Chapelle, 2000). This is done through the use of reliable instrumentation, which is essential for claiming valid test use (Lynch, 1996, 44). Assuming that an interactionalist paradigm is essential for interpreting such variation, factors such as item / task characteristics (ie input such as a text to be manipulated in some way), the instructional rubric, and characteristics of the test setting need to be specified in order to ensure reliability.

When test-takers are assessed differently, ergo unfairly, because some raters are more severe than other raters, this is clearly connected to faulty instrumentation: in that case, scoring that is inconsistently applied; that is, more or less severe. In the case where items do not differentiate between test-takers in a way that adequately reflects differing degrees of performance, that is also a case of faulty reliability: the items do not refer back in a consistent way to the original construct.

I believe that this is the task which is ideally suited for Rasch analysis: by controlling for the severity of raters, students can receive more consistent scoring between test items on tests they take, and between students taking the same test. Rasch analysis is useful for checking reliability, but critically, does not say anything about what the items themselves measure, or how they are interpreted by raters.

Rationale

What tests should measure, as supported by an examination of the relative construct validity, is a critical issue which Messick (1988, 66) refers to as "ultimate relevancy." Messick (1988, 20) further sees construct validity as involving both an interpretation of the construct being measured, as well as the introduction of social implications, what Bachman and Palmer call impact in their testing parlance. In other words, validity does matter, and it is important that test designers and users agree on what they are doing.

". . . all elements which make up the three stages of test development, must be progressively examined in order to determine if test use is validated."

To summarize, one idea that arises from studying the works of people like Messick and Bachman is that the evaluation of language tests cannot proceed in isolation of the other components. That is, all elements which make up the three stages of test development, must be progressively examined in order to determine if test use is validated. Restricting analysis to the results of test-takers' performance on the test by itself cannot tell us anything about whether the test should have been used in the first place.

Questions that can guide test designers for evaluating the construct validity of their testing protocol are summarized in Lassche (2006):

Is the language achievement construct for this test clearly and unambiguously defined?
Is the language achievement construct for the test relevant to the purpose of the test?
To what extent does the test task reflect the construct definition?
To what extent do the scoring procedures reflect the construct definition?
Will the scores obtained from the test help us make the desired interpretations about the test takers' language achievement?

Where Dr. Bond sees Rasch analysis as essential for test quality control (such as question 3 through 5 above), I see it as step in a reliability check. Agreed, reliability is an important component for test development as Bachman and Palmer contend, but what is primary, and initial, is agreement about what we are measuring in the first place (questions 1 and 2). The first two stages of test development, concerning the test construct, are missed in the rush to judgment to declare testing practice as "dependable".

[ p. 77 ]

Caveat

It is quite possible that all the presenters mentioned here agree with the 3-step test design process I have just summarized. It is also possible that the tests described in the forthcoming test cases went through their respective test development stages. These steps were quite possibly glossed over in the presentations, precisely because validity was not the point! The presenters were interested in discussing the application of Rasch analysis to check the reliability of their tests, and so focused on that aspect as the content for their presentations. And they are certainly justified to adopt such a focus for the sake of time limits and simplicity. So is my reaction a non sequitur?

What I want to make clear is this: I have no problem with appropriate use of Rasch analysis – within the context of a larger validation process. The testing protocols that I will describe below, if they had gone through some kind of validation at all, to me seemed to be "off". The tests used had significant validity issues. To my eye, the presenters did not seem to be very concerned with that, and seemed to be happy discussing how Rasch analysis could be applied to their particular situations. I began to wonder if this reflected a persistent pattern: when enough data could be marshaled through the use of techniques, like a Rasch analysis, that showed high reliability scores, validity issues could be masked, or rendered simply irrelevant. The agreement I mention in the preceding paragraph, in cold light of day, could just be paying lip service enthralled to pragmatics: let's give scores to students that no one will argue with. Well, Messick, Bachman, and others simply cannot agree.

". . . when testers talk about reliability, the assumption should be that validation has taken place first."

What I see Messick and others as saying is this: when testers talk about reliability, the assumption should be that validation has taken place first. When that has, in fact, not taken place, the discussion erodes the trust that is placed in test designers and in subsequent discussions. Since tests have such an impact in education, that trust is crucial in upholding our practice.

Test cases

At the 2007 conference, I saw three presentations which referred to the use of Rasch analysis. These are summarized in Table 1.

Table 1 Three Rasch analyses from the 2007 JALT Pan-SIG Conference

Presenter	Test type	Test content	Rasch analysis
Takaaki Kumazawa	Course Achievement exit test	20 MC Vocabulary items 20 MC Reading Comp items	Dependable
Trevor Bond	Placement proficiency test	MC test (item #'s na)	Not dependable
Ed Schaeffer	Rater difficulty in thesis exit evaluation	Theses	Dependable

[ p. 78 ]

Presenter 1 - Course achievement analysis

In this case, Mr. Kumazawa used Rasch analysis to see if the items on his course achievement test adequately differentiated among three classes of students in terms of their mastery of course objectives. The presenter described student performance for the entire course as being evaluated for attendance (worth 40%), and a 40-item multiple choice (MC) test worth 60%, split between a vocabulary section and a reading comprehension section. The connection between test results and mastery of course objectives was not elucidated. No description of course objectives or sample items of test content were provided, so there was no way to know if a validation process had taken place.

Mr. Kumazawa showed that the Rasch analysis indicated that the items differentiated among the students in a fairly reliable way. The unanswered question is, do these particular items differentiate among students, in terms of course content mastery? In other words, when Mr. Kumazawa constructed his test, did he use items that reflected back on the curriculum objectives for his particular course?

Now, Mr. Kumazawa may have taught his class as test preparation course, in which examples of MC test items formed the bulk of class-time in a lecture-based format. If this was indeed the course content, then his test probably has curricular relevance: a MC test to assess content about MC items. There is still a question of impact: is it fair to assess students' performance in one shot? Would such a testing scenario elicit best performance, as good tests should do?

On the other hand, what about the basic question of validity: what should we be testing? This depends on a more prescient question: what should we be teaching? Mr. Kumazawa called his testing "criterion-referenced" (CRT) in the title of his presentation. In the testing field, CRT conventionally refers to some communicatively-oriented criterion descriptor, usually in the form of band descriptors (Douglas, 2000, p. 15ff). Critically, a criterion in the spirit of CRT is not merely a score of "A" or "B", as someone in the audience opined. The scoring is thus not a normative or arbitrary one: it is linked to a communicative construct. The C1 band descriptor for assessing speaking in section 2.2 above is an example of a typical criterion. It is hard for this observer to see what answering 40 MC vocabulary and reading comprehension items on a paper-and-pencil test represents about communicative ability, apart from measuring test-wiseness.

In fact, I have seen this practice done before, and writers have talked about it before: some stakeholders simply play a shell game with currently in-vogue terms, labeling a class or a seminar as communicative, or reflective, when in fact nothing in the content has changed from more traditional practices. I believe that this is what Mr. Kumazawa has done, and he used Rasch analysis to obtain data which justified his testing was fine, ergo course content was fine too.

In this case, I think that Mr. Kumazawa needs to stop doing reliability checks, go back to Stage 1 in his test development, and re-examine his personal beliefs about what he should be teaching, in addition to what his tests should be measuring. A more communicative approach in both content and assessment would be welcome.

[ p. 79 ]

Presenter 2 - Placement test

In his plenary, Dr. Bond referred to a placement test currently in use at the Islamic University in Malaysia. The test rates students according to five proficiency levels, each of which reflects achievement of different grammatical items, which were sequenced according to expected/percieved levels of difficulty. These elements are, again, measured with an MC-format. Thus, for example, "interrogatives" were part of level four, "passives" were part of level 3, and so on. A Rasch analysis indicated that the grammatical elements were incorrectly ascribed to their levels. Among numerous examples, some Level 4 items were easier to answer than items in Levels 1 or 2. Thus, placing students into classes based solely on the basis of this test result was highly unfair.

Following on from what I perceived as Dr. Bond's logic, what would seem to be the solution is reorganizing the items ascribed to each level, based on their real difficulty; that is, the difficulty as emergent from actual test performance. Thus, "interrogatives" would no longer be associated with Level 4 performance, it would be re-assigned into Level 1.

As with Presenter 1 above, the Rasch analysis is only able to evaluate the data it is provided. Essentially, that amounts to nothing more than a "garbage-in, garbage out" formulation. The analysis cannot go beyond the data, to speculate on content validity, and answer the question: should testing like this be done in the first place? Core issues about the intrinsic nature of an MC item corresponding with communicative ability, and tests purporting to measure proficiency, but containing only grammar knowledge items, are unanswered. In the case of proficiency, surely productive and creative skills need space in the testing as well. Again, nothing in the presentation indicated a validation process for this test. Did the results show any convergence with exit scores, or with other ability measures like the TOEIC^® (which itself is subject to validity concerns)? I can only speculate, based on the prima facie design of the test, that the administration only desired to assign a numerical score to entering students to ease admissions selection. As long as the students could be assigned a "number" that no one argued with, then the admissions could proceed apace.

My recommendation in this case goes far beyond simply re-organizing grammatical elements in terms of test difficulty, which would not amount to a sufficient proxy of language proficiency anyway. Instead, the department needs to decide what language performance they expect of students at various levels of language development and school progress: in year 1, what should students be expected to know and do? And then year 2? And so on. This kind of work, which involves curriculum development (or defining the nature of language achievement), is something sorely lacking in many departments I have come across. Professors, in my experience, are generally free to design courses according to their own whims, with departments holding actual student performance (ie what they can do with language, not just having an arbitrary course grade) to little if any accountability. Surely this crucial first step is the most important, and it is this lacuna that gives rise to the inconsistent, uninformed testing practices taking place today in many institutions?

One possible way to do this would be to link curricular objectives with the kind of language skills described by the Council of Europe. The ambiguity I referred to earlier is being addressed through the use of more detailed profile data banks, and it is an ongoing process. As a baseline for discussion in English departments, however, it would certainly provide a much-needed guideline. A testing process based on this set of guidelines is the Cambridge ESOL exams, which could serve as possible entrance/exit assessment measures.

[ p. 80 ]

Presenter 3 - Thesis evaluation

During the colloquium, Mr. Schaeffer presented the case of mitigating rater severity in thesis evaluation. He found that raters differed in their assessments of students' thesis content, and using Rasch analysis allowed the faculty to adjust the final score given to students, giving a more dependable result.

In this case, the construct for determining student performance is obvious from first glance: thesis completion. In questioning that followed Mr. Schaeffer's presentation, it emerged that no discussion among the faculty had ever taken place about what a good thesis actually looked like.

Thus, to one faculty member, good theses might show copious research with references aplenty. To another, the content must be organized in a certain way. To another, the use of numerical data, and the application of advanced statistics, takes precedence. But tellingly, Rasch analysis does NOT tell us what construct raters are using when marking severely or not, much less what they should be using. Ratings are merely observed and adjusted for, but not operationalized. Thus, students do not receive feedback on how they could improve, much less what they did wrong, and teachers do not have a firm basis on which to defend their scoring. It reminds me of a poem by John Godfrey Saxe (1873, p. 77):

And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong!

In this poem, the wise blind men have never seen an elephant, and do not know what one looks like. By using their own peculiar perspective (in this case, what was tactically proximate), they defined their own narrow view of "elephant-ness".

In my opinion, what is needed in for presenter #3 is more work done at Stage 2 – construct realization. There needs to be a consensus among faculty of what a "good thesis" looks like! How prominent should research be? How important is grammar content? Length of the paper? When all these tacit assumptions about thesis characteristics are spelled out, a fairer and more consistent evaluation of student work will emerge. Simply averaging the "snake-like" tail of one rater with the "fan-like" ears of another rater would not create a better-looking elephant for the poor folks in Indostan. And neither will averaging rater results in thesis evaluations help students know what was good, and not, with their own performance. In addition, some factors may be called into question as having curricular irrelevance: is spelling really that important? Are data-driven results really more important than logical, cogent argumentation?

I have included my own essay evaluation form (in table 2 below) to show that such criterion-based evaluation of long written work is possible. It is made up of three main components (structure, content, and style), with each containing sub-components. The 5-point Likert scale for each sub-component represents my estimation of the frequency of appearance (i.e. for logically connected, a score of "4" would indicate that 80% of the time, the logic was clear to me). For thesis scoring scenarios certain components and their indicators probably need to be adjusted and given different weightings, but at least the foundation for a discussion of construct realization is evident.

[ p. 81 ]

By using such a scoring mechanism, students can tell where there areas for improvement are needed. Raters can compare their scores and see where they differed in their estimations, and the adjustments can take place in a rationally discussed environment, rather than a statistically-driven one. My own reason for developing this was a little less noble: it meant less work for me! The sub-components were actually the kind of comments I tended to make on students' essays. By accumulating these comments into one checklist, I could save my own feedback time, be more consistent with it, and provide specific examples.

Table 2 A form for evaluating essays, according to structure, content and style
Table 2

Discussion

What alarmed me during these conference presentations was the faith placed in data-driven analyses using the Rasch analysis modeling structure, to judge whether testing practice was effective or not. While Rasch is certainly effective for estimating reliability, and adjusting for inconsistent ratings, I believe that it is not a replacement for initial test development at the construct definition and construct realization stages. If Rasch analysis is merely a post-hoc seal of approval on testing practices that hearken back to the grammar translation and audio-lingual era, then we are heading for trouble.

[ p. 82 ]

Dr. Bond mentioned at the end of his plenary that curriculum writers were, to use the parlance of this essay, the blind men of Indostan, ignoring the facts that empirical investigation can provide. While empirical investigation is important, and goes ever onward, it cannot escape its own intrinsic limits: it cannot measure what it cannot see. And thus, the men of Indostan are actually, ahem, also the test designers described in these pages. If the Rasch model is only fed data about a snake-like tail, it will tell you that this tail is long enough to be an elephant. It cannot tell if the snake-like tail is any better an impersonation of an elephant than the trunk-like legs. It cannot conceive that all these elements all fit together, in a gargantuan, awesome kind of way.

As humans, we are not limited by what we can see. Our suspicion about a big universe led us to develop telescopes. Our suspicion that apples fell from trees in predictable patterns led us to Newtonian physics. In other words, our beliefs of how the world worked was based on our intuition initially, our mental constructs, and through ingenuity we developed the techniques to measure these suspicions and thus promote them to the consensus of fact.

And so it is with language. We know that language is more than bits of grammar and lexicon that multiple choice formats tend to measure. However, our current instrumentation, with numerical indicators, tends to measure language in this way. These techniques are currently the most conceptually proximate to our understandings of how language works. An analysis of these data using such techniques cannot then prove that this is all there is to language; that would be a circularity. Our concepts of how language will evolve, and our testing practices will follow suit. Our imagination, and as educators our moral imperative, is thus to go beyond the current testing paradigms and keep looking for better practice, for both teaching and testing. And one day we will see more clearly.

NOTE: A response to some of the criticisms mentioned in this paper is online
at http://www.jalt.org/pansig/2008/HTML/SchaKum.htm.

References

Bowles, M. (2001). Problems in treatment of vocabulary in approved junior high school Textbooks: Informing Teachers. The ETJ Journal, 2 (2), 18-20.

McGroarty, M. & Taguchi, N. (2005). Evaluating the communicativeness of EFL textbooks for Japanese secondary schools. In C. Holten & J. Frodesen (Eds.) The power of context in language teaching and learning. (pp. 211-224) Boston: Heinle & Heinle.

Ministry of Education, Culture, Sports, Science, and Technology. (2001). Textbook Q and A. In Japanese. Retrieved October 2, 2007 from http://www.mext.go.jp/a_menu/shotou/kyoukasho/010301.htm

Ministry of Education, Culture, Sports, Science and Technology. (2007). Formal Education: Elementary & Secondary Education: 3. Textbooks. Retrieved May 25, 2007 from http://www.mext.go.jp/english/org/f_formal_16.htm

Sage, K. & Tanaka, N. (2006). So what are we listening for? A comparison of the English listening constructs in the Japanese National Centre Test and TOEFL^®iBT. In T. Newfields, I. Gledall, M. Kawate-Mierzejewska, Y. Ishida, M. Chapman, & P. Ross. (Eds.) Authentic Communication: Proceedings of the 5th Annual JALT Pan-SIG Conference. May 13-14, 2006. Shizuoka, Japan: Tokai University College of Marine Science. (p. 74 - 98). Retrieved Sept. 1, 2007 from http://www.jalt.org/pansig/2006/HTML/SageTanaka.htm

2007 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 83 ]