Proceedings of the 1st Annual JALT Pan-SIG Conference.   May 11-12, 2002. Kyoto, Japan: Kyoto Institute of Technology.

to Main Article (Part 1)


Let us consider students 29 and 32 in Table 5 for a moment. Both stand out in Table 6 because of their unexpected response patterns. In both cases, their scores were lower than expected. Although the standardized residuals in Table 5 for these two students were not crucial (only -4 for student 29 and -3 for student 32), we could delete this combination (Rater Y, Student 29 and vocabulary; and Rater Y, student 27 and student 32) from the statistical analysis. In Table 3, as shown previously, Rater Y's fit statistic was relatively lower than the others, which may have been caused by these unexpected response patterns.
Table 5 gives us only limited information about misfitting students. Other misfitting students could not be explained. If we want to examine Student 21 in more detail, whose misfit statistic was over 2.0, we might look at this person from the point of view of the person's rating category.
In Table 5, the item vocabulary appears twice as part of the unexpected response pattern. The key element here is probably the combination of Rater Y and vocabulary. Since the number of items is only 7, it is not a good solution to delete vocabulary items entirely. Instead, we might delete the combination (Rater Y, vocabulary), although it has nothing to do with student 27. In this way, we might keep most of the necessary information.
Thus, Sub-Question 4 reveals that student ability differed widely. Whereas five students seemed to misfit the parameters of this particular sample, two overfit.

Table 5: Unexpected responses (3 residuals sorted by order in data)
Category Step Exp. Resd St Res N raters Nu st N Items
3 3 3.9 -.9 -3 Rater Y 27 27 4 vocabulary
3 3 3.9 -.9 -4 Rater Y 27 27 4 vocabulary
3 3 3.9 -.9 -3 Rater Y 32 32 2 discourse

Sub-Question 5: How widely did the item difficulty differ?

Table 6 indicates the difficulty of items in the Measure column. Grammar appeared to be the most difficult item, followed by fluency. Content appeared to be the easiest item.
Table 6 also shows that vocabulary is an item worth examining because it is a misfitting item outside of the acceptable range (0.6-1.4) and also because it is related to the unexpected response in Table 6. Since vocabulary is widely regarded as an essential in composition assessment (cf. Kemp & Toperoff, 1998), it might not be a good idea to delete it from this item list simply because it was underfitting. Rather we could try another way of dealing with this problem. When we compare Table 5 (student 29) with Table 6 (an unexpected response) and Table 7 (Item: vocabulary) we can decide whether we should delete this combination of responses from the statistical analysis temporarily. Doing this would not damage our whole statistical analysis drastically.
The "overal score" item, which is overfitting (.4 -.4), was intended to give a general idea of students' performance. It tends to come to the center of the rating category. In other words, the scores will be consistent, and no new or specific information is expected from this item. Coming toward the center is the nature of this item, which is less informative and tends to have a misfitting value. However, we should not delete this item because it still gives us an overall view of students' ability. Thus, Sub-Question 5 indicates that seven items used in this rating scale did vary significantly in terms of difficulty.

[ p. 177 ]


Table 6: Item measurement report (arranged by N)
Observed Score Observed Count Observed Average Fair-M Average Mode Measure S.E. Infit Mean Square Z Std Outfit Mean Square Z Std ? N items
172 64 2.7 2.78 -1.02 .23 1.1 0 1.1 0 1 grammar
195 64 3.0 3.09 .22 .24 .7 -1 .8 -1 2 discourse
202 64 3.2 3.09 3.19 .64 .25 0 .9 0 3 content
201 64 3.1 3.17 .58 .25 1.6 2 1.8 2 4 vocabulary
185 64 2.9 2.96 -.34 .23 .8 -1 .7 -1 5 fluency
194 64 3.0 3.07 .16 .24 1.1 0 1.1 0 6 organization
187 64 2.9 2.98 -.23 .24 .4 -4 .4 -4 7 overall
190.9 64 3.0 3.03 0 .24 1.0 -.6 1.0 -.5 .49 Mean (Count: 7)
9.7 0 .2 .13 .54 .01 .4 2.2 .4 2.1 .10 S.D.

RMSE (Model): .24 Adj S.D.:.48 Separation: 2.00 reliability: .80
Fixed (all same) chi-square: 36.4 d.f.: 6 significance: 0
Random (normal) chi-square: 6.0 d.f.: 5 significance: .30

Sub-Question 6: How effective was this paired rating system?

Table 7 indicates that all 6 pairs of raters worked well within the acceptable range of 0.6 - 1.4. This means that the present paired rating functioned well. In other words, there were not any unfair ratings because the specific rating pairs had no significant impact. This suggests students' scores were not dependent on the arrangement of the paired raters. The answer to this sub-question is that none of the four raters in this study needed to rate all the students' compositions individually since the paired rating system functioned well within the acceptable fit statistics range.

Table 7: Pair measurement report (arranged by N)
Observed Score Observed Count Observed Average Fair-M Average Mode Measure S.E. Infit Mean Square Z Std Outfit Mean Square Z Std Number of pairs
197 70 2.8 2.97A 0 .23 .8 -1 .8 -1 12 AD
260 98 2.7 3.01A 0 .18 1.0 0 1.0 0 13 AM
186 56 3.3 3.07A 0 .27 1.2 0 1.1 0 14 AY
248 84 3.0 3.02A 0 .21 1.1 0 1.1 0 23 DM
222 70 3.2 3.07A 0 .26 .8 0 .8 0 24 DY
223 70 3.2 3.12A 0 .25 .9 0 1.1 0 34 MY
222.7 74.7 3.0 3.04 0 .23 1.0 -.2 1.0 -.1 Mean (Count: 6)
25.9 13.2 .2 .05 0 .03 .1 .8 .1 .7 S.D.

RMSE (Model) .24 Adj S.D.: 0 Separation .00 reliability: 0
Fixed (all same) chi-square: .0 d.f.: 5 significance: 1.00
Conclusions and implications

In summary, we can say that a rater's workload can be reduced by paring in the systematic way advocated in this paper. The system of paired rating mentioned here can be used to validly writing or speaking performance.

[ p. 177 ]


With the results of this research, teachers can be more positive about assessing students' writing or speaking performance. The reduction of teachers' workload in the evaluation process will enable them to give more authentic evaluations.

References


Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bode, R. K., & Wright, B. D. (1999). Rasch measurement in higher education. Higher Education: Handbook of Theory and Research, 14, 287-316. New York: Agathon Press.

Davies, A., A. Brown, C. Elder, K. Hill, T. Lumley & T. McNamara, (Eds.). (1999). Dictionary of language testing. Cambridge: Cambridge University Press.

Douglas, D. (2000). Assessing languages for specific purposes. Cambridge, UK: Cambridge University Press.

Kemp, J. & Toperoff, D. (1998). Guidelines for portfolio assessment in teaching English. Retrieved May 99, 2002 from http://www.anglit.net/main/portfolio/default.html.

Linacre, J. M. (1989, 1993, 1994). Many-facet Rasch measurement. Chicago, IL: MESA Press.

Linacre, J. M. (1998). Facets Rasch software users guide. Chicago, IL: MESA Press.

Linacre, J. M. & Wright, B. D. (1998). Facets: Many-faceted Rasch analysis. Chicago, IL: MESA Press.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149-174.

McNamara, T. (1996). Measuring second language performance. London and New York: Longman.

Rasch, G. (1960, 1980). Probabilistic models for some intelligence and attainment tests. Copenhagen and Chicago: University of Chicago Press.

Rudner, L. M. (1992). Reducing errors due to the use of judges. Practical Assessment, Research & Evaluation, 3, 3. Retrieved on May 19, 2002 from http://pareonline.net/getvn.asp?v=3&n=3.

Wright, B. D. (1997, Winter) A history of social science. Educational Measurement: Issues and Practice, 52, 33-45.

Wright, B. D. (1997). Fundamental measurement for psychology. In S. Emretson & S. Hershberger (Eds.). The new roles of measurement: What every psychologist and educator should know, (pp.65-104). Hillsdale NJ: Lawrence Erlbaum Associates.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL: MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, IL: MESA Press.


Return to Main Article (Part 1)


2002 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

[ p. 178 ]

Last Next