Response to Rasch & quality control: Controlling data, forgetting quality?

Divergence and Convergence, Educating with Integrity: Proceedings of the 7th Annual JALT Pan-SIG Conference.
May. 10 - 11, 2008. Kyoto, Japan: Doshisha University Shinmachi Campus. (pp. 47 - 51)

Response to Rasch & quality control: Controlling data, forgetting quality?

by Edward Schaefer and Takaaki Kumazawa

Abstract

This article responds to a critique of the authors' Pan-SIG 2007 presentations by Gerry Lassche in the 2007 Pan-SIG Proceedings, in which he maintains that our studies utilizing many-facet Rasch measurement (MFRM), while concentrating on reliability, neglected the issue of validity. The article begins with a discussion of the ways in which Rasch measurement does in fact address validity issues. Kumazawa then clarifies the terms reliability and criterion-referenced in reference to his study in which a 20-item multiple-choice criterion-referenced vocabulary test is used to separate three groups of students. Finally, Schaefer discusses validity arguments for a holistic scale used for rating 25 graduate theses rated by three raters, and emphasizes that ideally raters do need to elucidate an explicit construct of a "good thesis" before conducting MFRM.

This article is in response to Gerry Lassche's article in the 2007 Pan-SIG Proceedings, which critiqued our presentations on Rasch analysis at the 6th Annual JALT Pan-SIG Conference in Sendai. In his article, Lassche reported on presentations by Trevor Bond, Edward Schaefer, and Takaaki Kumazawa. In this article we will first comment on Lassche's overall critique of Rasch measurement, and then respond specifically to his critiques of Kumazawa and Schaefer.

Rasch analysis and the question of validity

First of all, we would like to thank Lassche for showing an interest in our presentations and for taking the time to comment on them. However, we feel that there were some important misconceptions and misinterpretations in his article which need to be cleared up. Lassche emphasizes the importance of construct validity and makes the point that merely juggling numbers through statistical testing without first establishing the construct is a faulty endeavor. He is certainly correct in stating that Rasch "cannot be a stand-in proxy for such validation procedures" and in his emphasis on the primacy of validity. Where he goes astray is in his assertion that Rasch measurement has nothing to say about validity. Validity is, in fact, a central concern of Rasch theory. As Bond and Fox (2007) state: ". . . . good tests have, as their basis, a very clear and explicit understanding concerning the line of inquiry the test is trying to put into practice – what was once termed construct validity" (p. 51). They further point out the danger of trying to make "a silk purse out of a sow's ear" by using Rasch techniques on poorly crafted test instruments or sloppily collected data (p. 268). Smith (2004) also describes the processes by which Rasch measurement provides a means to assist in the validation process, and concludes that ". . . . methods available within the framework of Rasch measurement can be used to address various aspects of the unified framework of validity theory as described by Messick (1989, 1995) . . . (p. 119).

[ p. 47 ]

As the above quotations illustrate, validity is a strong concern for Rasch theorists. Lassche himself cites Messick in his article, as do Smith (2004), Bond & Fox (2007), and other writers on Rasch analysis. There is no disagreement on the necessity to establish construct validity and to have a clearly worked out theoretical basis for constructing tests or collecting data. To the extent that this was not made clear in our presentations, that is the fault of the presenters, who perhaps due to time constraints did not sufficiently explain the theoretical foundations underlying their research (though Bond in his talk introduced the theory of Rasch measurement). This possibility was acknowledged by Lassche when he states, "These steps were quite possibly glossed over in the presentations . . ."

"Rasch measurement . . . basically test[s] the notion of unidimensionality, the idea that a test should be testing one thing at a time,"

Rasch measurement software also provides researchers with diagnostic tools to test validity following the analysis. Such features as the logit scale and fit statistics basically test the notion of unidimensionality, the idea that a test should be testing one thing at a time, e.g., math ability, foreign language writing ability, etc. If a person has more or less of the ability, or latent trait, under consideration, then that person should be placed in a higher or lower location along the measurement scale in an ordered fashion. When persons, or items, do not fit the scale, this indicates potential problems with the construct validity.
Thus, when Lassche claims that ". . . raters may be using completely different interpretations of the criterion to arrive at . . . similar scores. . . . Rasch analysis would not discriminate one rater as more or less severe than the other. This might lead to the conclusion that the raters are interpreting the construct in the same way, when in fact, they are not," this is simply not the case. That is exactly the type of information which Rasch analysis gives to the researcher. The many-facet Rasch measurement (MFRM) model adjusts raters' similar raw scores into differentiated logit scores considering person ability, item difficulty, and rater severity. Fit statistics, the rater report, rater variability map, reliability of separation indices, and bias analyses all provide information on whether raters are interpreting the scale in similar ways (Engelhard, 1994; McNamara, 1996; Wigglesworth, 1994).
At one point Lassche illustrates his concerns about validity by referring to the blind men of Indostan. In this famous story, five blind men each touch a different part of an elephant, forming a different idea of what the elephant looks like. One blind man touches the sail-like ears and reports that this is what the animal looks like. Another man touches the snake-like tail and claims that this is the animal's true shape.
To extend this metaphor, if the elephant is the construct, or the latent trait under investigation, then it is by definition invisible. It cannot be seen or measured. Only its manifestations can be measured. This is where Rasch measurement comes in. Once you have theorized the characteristics of "elephantness", then you can measure the parts of the body to see if the data fit the model. If the ears are pointed instead of sail-like, or the nose is wet and round instead of trunk-like, or the tail is thick and furry instead of snake-like, then it's logical to conclude that perhaps you don't have an elephant here, but some other animal. Rasch measurement provides the tools for such investigations.
Following the expression of his concerns regarding Rasch validity problems, Lassche then commented on three of the presentations. We would now like to turn to the specific criticisms of the respective presentations of Kumazawa and Schaefer.

[ p. 48 ]

Kumazawa: Criterion-referenced achievement test: Examining group mean differences

As the title of the presentation indicates, the purpose of Kumazawa's talk was to show one way of investigating group mean differences among three groups on a 20 item multiple-choice vocabulary test used as a criterion-referenced achievement test using FACETS (Linacre, 2001), a software program for MFRM. Lassche's main criticisms of the presentation seem to have been the following:

No description of course objectives or sample items of test content were provided, so there was no way to know if a validation process has taken place (p. 79).
It is not "fair to assess students' performance in one shot" (p. 79)
It is hard to see what "answering 40 MC vocabulary and reading comprehension items on a paper-and-pencil test represents about communicative ability, apart from measuring test-wiseness" (p. 79).

Once again the purpose of the presentation was to show how the FACETS analysis can be used to examine the mean group differences. Although Kumazawa did not explicitly display the objective and sample items, he did mention that all the vocabulary items were taken from the assigned textbook that students studied in the reading course. So, one can assume that the objective was to test mastery of receptive meanings of the vocabulary that appeared in the textbook tested on a multiple choice test. Treating reading as a receptive skill, Kumazawa decided to use a multiple-choice item format in order to test students' mastery of the receptive meanings of the target vocabulary. Moreover, although he did not present his validity arguments given the limited time available, one way to provide validity arguments for a CRT can be done by an intervention construct validity study, which he did in fact conduct (see http://jalt.org/pansig/2007/HTML/Kumazawa.htm).

Schaefer: A multi-faceted Rasch analysis of a graduation thesis assessment

"To make valid decisions, teachers should evaluate both students' product and process, and set multiple criteria."

There may also have been a misunderstanding regarding the term reliability. Lassche wrote "the Rasch analysis indicated that the items differentiated among the students in a fairly reliable way" (p. 79) and cites his own 2006 article to give a summary of reliability. However, Kumazawa was using the term reliability in the sense of being able to discern separation. He used a FACETS model for his analysis, setting groups as one facet, in order to investigate mean group differences. He found that the test items differentiated among the groups of students, not among the students. That is, the group ability of the vocabulary knowledge tested with the CRT was different among the three groups of students. The FACETS analysis was able to show that the vocabulary items were reliably separating the three groups.
Lassche used the term "reliability" to refer to the test consistency of CRTs. In the tradition of norm-referenced tests (NRTs), test consistency is called reliability. NRTs are used to make relative decisions by comparing a test score to the norm. However, in the tradition of CRTs, the analogy to reliability in NRTs is called dependability (Brennan, 1980, p. 187), but it indicates test consistency for making absolute decisions. Kumazawa used Brown's short-cut formula to estimate the dependability of the CRT in his presentation.

[ p. 49 ]

Lassche cited Douglas (2000) to define the term CRT, but we would like to clarify the term in the way used in Kumazawa's presentation. Developers don't have to have a "communicatively-oriented criterion descriptor," or a "communicative construct," but they do have to have a criterion to make a test criterion-referenced. The term criterion has two connotations: (a) domain and (b) cut-off point (Popham & Husek, 1969). It is used as a domain of score or construct that is being measured using a test. It also implies a set cut-off point. Therefore, criterion-referenced decisions are made based on the extent to which students master a domain, whether or not they exceed the set cut-off point, or a combination of both.
Lastly, regarding final grades, as Lassche said, it is certainly not fair to decide students' final grades based on a one-shot multiple-choice test. To make valid decisions, teachers should evaluate both students' product and process, and set multiple criteria. The test score on the final exam is a product of learning. Of course, this has to have a strong influence on the grade, but the process also has to be evaluated, so what students have done in class is, too, considered as valuable. To do so, it is important to take students' attendance and participation into consideration. Moreover, teachers should be explicit about criteria that they are using to decide their final grades. In other words, teachers should write a syllabus that clearly defines the goals/objectives and states the evaluation criteria such as attendance and participation. Since no single criterion fully captures students' performance in class, it is best that teachers have multiple criteria and set a cut-off point for each criterion so that it can be a clear description of what teachers are expecting from students.

Schaefer: A multi-faceted Rasch analysis of a graduation thesis assessment

". . . in spite of the absence of a clearly defined construct for writing ability, FACETS was able to reliably calibrate writer ability and rater severity, but that the raters did in fact interpret the scale in different ways."

Schaefer conducted a MFRM study of graduation theses written by 25 graduating seniors, which were rated by three raters: a Japanese professor representing the English and American literature division of the English department, a Japanese professor representing the linguistics division, and one foreign professor (the presenter). The rating scale was simply a holistic scale with a range of 100 points. Schaefer found that in spite of the absence of a clearly defined construct for writing ability, FACETS was able to reliably calibrate writer ability and rater severity, but that the raters did in fact interpret the scale in different ways. A FACETS bias analysis revealed that the foreign professor was the only rater who exhibited significant bias.
Lassche's main objection is that "no discussion among the faculty had ever taken place about what a good thesis actually looked like." He points out the necessity for the faculty to arrive at a consensus on this in order to achieve construct realization, and asserts that "Rasch analysis does NOT tell us what construct raters are using when marking severely or not . . ." He provides as an example a scoring rubric he uses in his own courses.

[ p. 50 ]

Here we are entirely in agreement, and his points are well taken, as shown in the discussion on validity at the beginning of this article. Schaefer in fact used a similar scoring rubric in a previous FACETS study of native English speaker ratings of Japanese EFL essays (Schaefer, 2004). But in the present study, that was simply the preexisting reality of the situation, regardless of whether Rasch analysis was used or not. One reason Schaefer undertook the study was to persuade the faculty of the need to elucidate an explicit construct. It was therefore surprising that the data turned out to be as sound as it was, with the only case of significant bias being the researcher himself. One possible explanation for this is that most of the faculty members have been working together for many years, and there is an implicit construct – their views on what constitutes good writing were thus not as divergent as Lassche worries they might be. This possibility does not, of course, mitigate the need for a clearly stated and theoretically justified construct for good thesis writing.
Lassche also seems to make the assumption that the MFRM was used to adjust students' thesis grades. This, however, was not the case. It was entirely unofficial, and the grading process took place as usual, based on an averaging of the raw scores of the raters. As stated above, it was undertaken to shed some light on the rating process and to persuade the other faculty members of the need to more clearly delineate a criterion for thesis writing. To the disappointment of the researcher, few of the faculty members displayed much interest.
Finally, we would like to end by again expressing our appreciation to Lassche for taking the time to respond to our presentations and giving us the opportunity to clarify and explain our positions. We would also like to encourage readers to refer to the works cited below, and to invite interested readers to attend Pacific Rim Objective Measurement Symposia (PROMS), an annual Rasch measurement symposium.

References

Bond, T. G. & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Lawrence Erlbaum: Mahwah, New Jersey.

Brennan, R. L. (1980). Applications of generalizability theory. In R. A. Berk (Ed.), Criterion-referenced measurement: The state of the art. (pp. 186-232). Baltimore: The Johns Hopkins University.

Engelhard, G. Jr. (1994). Examining rater errors in the assessment of written compositions with a many-faceted Rasch model. Journal of Educational Measurement, 31 (2), 93-112.

Lassche, G. (2006). article to give a summary of reliability Lassche, G. (2007). Rasch & quality control: Controlling data, forgetting quality? In T. Newfields, I. Gledall, P. Wanner, & M. Kawate-Mierzejewska. (Eds.) Second LanguageAcquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference. May. 12 - 13, 2007. Sendai, Japan: Tohoku Bunka Gakuen University. (pp. 42 - 55) Retrieved January 28, 2008 from http://jalt.org/pansig/2007/HTML/Lassche.htm.

Linacre, J. M. (2001). FACETS (Version 3.2) [Computer Software]. Chicago, IL: MESA Press.

McNamara, T. (1996). Measuring second language performance. Longman: London.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). New York: Macmillan.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.

Popham, W. J. & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement. 6 (1), 1-9.

Schaefer, E. J. (2004). Multi-faceted Rasch analysis and native-English-speaker ratings of Japanese EFL essays. (Doctoral dissertation, Temple University Japan, (2004). (UMI Number: 3125553)

Smith, E. V. Jr. (2004). Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective. In E. V. Jr. Smith & R. M. Smith (Eds.) Introduction to Rasch measurement, pp. 93-122. JAM Press: Maple Grove, Minnesota.

Wigglesworth, G. (1994). Patterns of rater behaviour in the assessment of an oral interaction test. Australian Review of Applied Linguistics, 17 (2), 77-103.

2008 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 51 ]