Investigating student response patterns on a multiple-choice grammar and reading proficiency test

Lifelong Learning: Proceedings of the 4th Annual JALT Pan-SIG Conference.
May 14-15, 2005. Tokyo, Japan: Tokyo Keizai University.

Investigating student patterns on a multiple-choice
grammar and reading proficiency test

by Yuji Nakamura (Keio University)

Abstract

This paper presents an exploratory study which shows how test development can be informed by analyzing the different ways that different groups of students respond to a given item on a given test. The Rasch Measurement model was used to analyze the performance of three groups of students (Business, Economics, and Law majors) on a multiple-choice test of reading and grammar proficiency. This study compared the different ways in which each group of students responded to each item on the test. The results of this analysis can be used not only to improve the test but also to inform the teaching process. However, it should be emphasized that this is a pilot study that needs to be replicated with a larger number of students before the results can be generalized to other situations.

Keywords: differential item functioning, response pattern investigation, individual test takers, Rasch measurement

I. Purpose

Differential item functioning is a feature of a test item that shows up in a statistical analysis as a group difference in the probability of answering an item correctly. The presence of differentially functioning items in a test has the effect of boosting or diminishing the total test score of one or other of the groups concerned. As Davies, et al (1999, p. 45) note, items exhibiting DIF "may be regarded as biased only if group differences can be traced to factors which are irrelevant to the test construct."
Differential item functioning (DIF), which explores the relative difficulty of a test item in relation to some characteristic of a group to which it has been administered has attracted a great deal of attention from test developers because it is a way of analyzing which items pose a considerable threat to the validity of scores on general English tests (cf. Studies in Language Testing 6, p.142).

". . . it is crucial to detect DIF items in language proficiency tests . . . "

In particular, it is crucial to detect DIF items in language proficiency tests in which test takers with diverse backgrounds are involved. The examination of each individual student's response pattern also supplies important information about their performance which teachers can use to give feedback to their students.

[ p. 76 ]

In classical item facility comparisons, differences in performance between groups of students are examined to identify whether or not one group of students was disadvantaged relative to another on the test.
This paper examines a multiple-choice test of grammar and reading proficiency to provide an example of how items that function differently for one or more of the three student groups (Business, Economics, and Law Majors) can be identified through the Rasch Model. In other words, through the DIF procedure, the paper investigates how students grouped by their major subject perform differently on individual items.

II. Research Design and Method

Three groups from a university in Tokyo were used in this study. One group consisted of 33 (20 male and 13 female) economics majors. Another consisted of 21 (14 male and 7 female) business administration majors. A third group consisted of 8 (6 male and 2 female) law majors. The respondents were between 18 and 22 years old.
The test material was composed of twenty grammar items and ten cloze items listed in Appendix A. The grammar items were designed to measure students' grammatical knowledge and the cloze items were thought to measure students' grammatical knowledge, and vocabulary knowledge, and general reading ability. The whole test was developed by a group of English teachers working for the above institution.
The students had 40 minutes to complete this test. The lack of any missing data (items unanswered) indicates that there was enough time for them to consider every item in the test. Although there were no specific incentives for them in terms of their grades, they seemed to take the test seriously as a measure of their English proficiency and response validity is assumed.

III. Procedure of the Analysis

Test analysis can be conducted in various ways, but in this paper the basic psychometric properties of the test in Appendix B were analyzed first. This means the test's validity and reliability were examined to assess the extent to which it measured the traits it was designed to test. The validity of this test was assessed through chi square measurements, logit residual of fit statistics, and the location order of items from easy to difficult. The reliability of the test was analyzed using the Person-Item Location Distribution index (otherwise referred to as the separation index). Let us now discuss these issues in detail.

1. Validity Issues

First, we will look at some of the basic psychometric properties of the test, that is, its validity (a question of whether it fits the Rasch model) and its reliability as indicated by observing the separation indexes).
Second, we will look at the question of whether the three student groups perform differently on each of the items (the Differential Item Functioning: DIF). This is important because if groups are performing very differently at the item level, then those items can be suspected of bias, i.e. favouring one or more groups above others due to an item characteristic outside of the construct it purports to measure. This is not good if one wants to select students on the basis of test results. However, it can be used to diagnose the strengths or weaknesses of each group to facilitate tailoring curriculum and teaching to specific groups.
Please note that the samples here are too small to make firm judgments about the test items. The data is used only illustratively to show how the model can provide information about student groups and about the items that could help in offering students feedback.

[ p. 77 ]

The validity of this test was assessed through the following three procedures:

The Chi square values were examined to look for sudden increases.
The logit residual tests of fit were investigated. If the value was <- 3.00 or > 3.00, (in the present case, a range of -2.00 and 2.00 was chosen to be much severer) then further analysis was conducted, especially if the Chi square value was also large and significant.
The location order, which shows the items in order from easiest to most difficult was examined. This indicates which items the students found easy and which they had trouble with.

2. Reliability Issues

The reliability of this test was assessed through the following three procedures:

By examining the separation index and the Cronbach Alpha reliability estimate.
By examining the Item Characteristic Curves (ICC)
By examining the DIF

IV. Results and Discussion

1. Validity

Tables 1 and 2 in Appendix B show that the item fit of this exam looked satisfactory, but this may partly be due to the small number of students in the sample. A larger sample may show more misfitting items. As mentioned earlier, a larger sample size is needed to make this analysis more certain. The minimum recommended sample size is 100.
Next, let us look at the location order of the items. Table 3, also in Appendix B, indicates that items 11, 30 and 27 were the most difficult, while items 20, 3 and 10 were the easiest.

2. Reliability

The separation index for this exam was 0.332 and the Cronbach Alpha reliability estimate was 0.354. This separation index was very low. Thus, the test did not discriminate among students very well. The distribution graph in Figure 1 shows this, probably because the test was too easy for most students in all three majors.

Figure 1. The person-item location distribution of the exam. (N=62)

[ p. 78 ]

3. Discussion of the Item Characteristic Curve (ICC)

Let us now divide the sample items into three types: Type One represents items which seem to have had little discriminating power. Type Two contains those items which fit well to the Rasch model with an ideal probabilistic curve, and had moderate discriminating power. Type Three denotes items which had rather strong discriminating power.

Figure 2. Type One items (Items 2, 13, 17, 20, 21, 11, 20, 27, and 30)

Type One represents items seemed to have little discriminating power. Items 2, 13, 17, 20, 21, 11, 20, 27, and 30 fell within this category, which is illustrated by the characteristic curve in Figure 2 with dots across the locations of the student groups. The three dots in those figures represent 24 students, 24 students and 14 students each, which were clustered into different ability groups by the RUMM 2020 software program.
The operating range of the variable in Figure 2 (represented by the horizontal axis) was too flat relative to the expected curve (represented by the smooth line) to discriminate among respondents in the three categories (represented by dots in the graph) on the basis of their total scores on the whole test.

Figure 3. Type Two items (#4, 5, 15, and 23)

[ p. 79 ]

Type Two items fit the Rasch model well and were between Type One and Type Three in terms of discrimination power.

Figure 4. Type Three items (#8, 10, 16, 18, 19, 24, and 26)

Type Three items showed too much discriminating power. In other words, the obtained curve was too steep to distinguish the three total score groups. Type Three exhibited a line which was steeper than ideal.
Please note that the three ability groups in Figures 2-4 are different from the three groups of majors discussed in Section II. In Figures 2-4 the groups (represented by dots) are based not on academic majors, but on test performance. The high-achieving group, consisting of 24 students, did best on the test overall. The low-achieving group, consisting of 14 students, had the poorest performance and middle group, consisting of 24 students, had a mediocre performance. As mentioned above, the number of students representing each dot is arbitrary. There is no specific rationale for using the groups of unequal size. The simple reason is that we were not able to have the same number of people in each group at the top and the bottom was because their scores were not equally clustered. Although in classical test theory, it is common to divide the students into equally sized groups.
We can select the number of clustering groups, and the program will put the students in the ability order, then it will cluster the students into the assigned groups in the ability order by taking into the demarcation of the ability score. In other words, if the same ability score falls on many students, the number of three groups will be different.)

4. Discussion of Differential Item Functioning

Note that the groupings in Figures 5 - 7 are by major – not by the total scores on the test as was done in Figures 2-4. This time let us check the ICC curves for each item from the viewpoint of the respondents' majors and focus on the Differential Item Functioning aspect. Again, we have three types of relationships between groups and items represented by Figures 5-7, but the criterion used for grouping items and test-takers has changed, so we will assign a letter for each type of relationship. (Because of the small sample size, these curves tend to be idiosyncratic.)
Type A (items for which all three majors have similar response patterns) have little or no DIF. Figure 5 indicates the expected values for the representative item, # 9 .

Figure 5.A Type A item in which all three populations had similar response patterns.

[ p. 80 ]

Type B contained items for which Law students (in green) performed idiosyncratically as shown in Figure 6. We can detect that, for items 29 and 20, the Law students seem advantaged relative to the other two groups. We should examine the content of these items. Could we expect Law students to know this content better than students from other majors?

Figure 6.A Type B item in which Law majors performed differently.

Type C contained items on which Business majors (denoted by the red line) performed rather differently from the other majors. Business majors seemed to perform better on Items 8 and 14 than other students. We should examine the content of these items.

Figure 7.A Type C item in which Business majors performed differently.

[ p. 81 ]

In summary, this information on the DIF of various majors shows that relatively speaking, the 8 Law students were often more idiosyncratic and unpredictable than 33 Economics or 21 Business students. However, the inconsistent patterns were likely due to the fact that the sample was so small. More data would be needed before decisions can be made about the usefulness of this test.
The Rasch model is a way to check validity and reliability and see how items function for different groups. Ideally they should all function in the same way, meaning that if the total score measures the same overall ability, all people with the same total score should have the same probability of getting any particular item correct.
Targeting is used to check the distribution of test-taker ability compared with the distribution of item difficulty. In this case, the scale could include more difficult items that would measure very able students a bit better.

Figure 8.The person-item location distribution of students' ability and item difficulty.

V. Conclusion

In conclusion, this paper provides a brief explanation of how the scores on a multiple-choice test for three groups of students can be analyzed through the Rasch Model.

". . . teachers tend to consider test takers as a homogenous group . . . [yet they] may perform differently depending on their majors . . ."

The use of the DIF procedure recommended here can reveal that groups of students perform differently on individual items. Generally, teachers tend to consider test takers as a homogenous group as far as the expected range of ability being tested and the appropriateness of test content. However, it has been pointed out that test takers may perform differently depending on their majors rather than on the ability being measured. This is probably because each different major group has different course and subject backgrounds provided in their individual major classes. What is needed is for test designers and users to think about the quality of the test items as well as the quantity of the test items. The emphasis can be shifted from the assumptions based on total tests scores to the appropriateness of items as well as from the whole test population to the test taker groups.

[ p. 82 ]

The possible educational value of this present research is to suggest that teachers as language testers should take into account the students' background such as their majors, genders, ages, and length of study in the area of second language learning.
Future study should be done from the viewpoint of a larger sample size along with a longer version of this test which could provide us more convincing results.

Acknowledgements:

This research was supported in part by Dr. Irene Styles
and Dr. David Andrich of Murdoch University, Perth, Western Australia.

Bibliography

Andrich, D. & Styles, I. (2004). Report on the psychometric analysis of the early development instrument (EDI) using the Rasch model. Centre for Learning, Change and Development. School of Education, Murdoch University.

Andrich, D., Sherican, B, and Luo, G. (2004). RUMM 2020: A Windows program for the Rasch unidimensional measurement model. RUMM Laboratory, Perth, Western Australia.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and McNamara, T. (1999). Studies in language testing 7: Dictionary of language testing. (pp. 44-45). University of Cambridge Local Examinations Syndicate. Cambridge University Press, Cambridge UK.

Milanovic, M. (Ed.) (1998). Studies in language testing 6: Multilingual glossary of language testing terms. (p.142). University of Cambridge Local Examinations Syndicate. Cambridge University Press, Cambridge UK.

Main Article

Appendix A

Appendix B

2005 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 83 ]

Investigating student patterns on a multiple-choice grammar and reading proficiency test

Investigating student patterns on a multiple-choice
grammar and reading proficiency test