Second Language Acquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference.
May. 12 - 13, 2007. Sendai, Japan: Tohoku Bunka Gakuen University. (pp. 84 - 96)

A Rasch-based evaluation of the presence of item bias in a placement examination designed for an EFL reading program

J Title
by Christopher Weaver (Jissen Woman's University)


Abstract

This paper is an empirical investigation examining the extent to which test takers' field of study at university influenced the difficulty of items on a placement examination designed to stream science majors into a compulsory EFL reading program. A Rasch-based differential item functioning (DIF) analysis found that almost 40% of the items on the placement examination favored either a group of 206 science majors or a group of 204 non-science majors. The biased items were evenly split between the two reading passages used in the placement examination with each reading passage favoring only one group of test takers. The paper concludes with a discussion concerning the implications of these findings for test developers, curriculum designers, and researchers interested in clarifying the interaction between students, L2 reading passages, and test items.

Keywords: Item bias, Rasch measurement theory, placement examination, reading comprehension

	プレースメントテストは最近多くの大学で行われている。その中には市販のテストをこの目的のために利用しているところもあるが、特定の大学のプレースメントテストはその大学のカリキュラムと密接に結びついているはずなので、理想的には当該の学生のレベル、大学の目標にあった独自のプレイスメント作りが行われるべきである。本稿は、慶應義塾大学文学部で実施したプレースメントテスト(4つの下部セクションから成り立っている:文法、語彙、読解、クローズテスト)の信頼性、及び妥当性について検証結果をのべ、今後の研究の方向性について考察を加えようとするものである。

キーワード プレースメントテスト、ラッシュ分析法、テストの妥当性、テストの信頼性




". . . institutions interested in a placement examination tailored to the specific needs of their language program are better off committing to the challenging, yet possibly rewarding journey of developing their own test items."

Increasingly there are a number of universities that use commercially produced proficiency examinations to stream students into their language programs. Although these examinations provide an overall indication of test takers' knowledge of English, they are written with the mass market in mind. The validity of these examinations within the Japanese academic context has also come into question (e.g. Chapman, 2005). As a result, institutions interested in a placement examination tailored to the specific needs of their language program are better off committing to the challenging, yet possibly rewarding journey of developing their own test items.
This paper reports on the development of a placement examination designed to stream science students into compulsory English reading courses offered at a national university in Japan. Of special interest to this project was the creation of a test that assessed test takers' English reading skills within the general field of science. It was hypothesized that test takers who had entered the university to pursue a degree in science should find the placement examination easier than non-science majors. In other words, the content of the placement examination should favor science majors. This hypothesis was based upon a collection of empirical studies (e.g. Brantmeier, 2003; Carrell, 1987) suggesting that reading is a content specific activity in which a reader's level of content knowledge and cultural background are important moderating variables for first and second language reading comprehension.

Research Questions

Drawing upon this schema-driven account of second language reading (e.g. Carrell & Eisterhold, 1988), the present investigation aims to determine the extent to which test-takers' field of study and their level of content knowledge influence the difficulty of a placement examination designed for science majors. As such, the following research questions guide this investigation:
  1. To what extent does the level of difficulty of the items on the placement examination differ according to the test takers' field of study (i.e. science majors versus non-science majors)?
  2. If any differences exist, where do these differences occur in terms of the reading passages used in the placement examination and the item types designed to assess test takers' literal comprehension, reorganization, and inference skills?
  3. To what extent do test takers from different fields of study differ according to their background knowledge about the reading passages, their reported level of difficulty reading the passages, and their perceived relevancy of the reading passages to their future studies at university?
The answers to these research questions provide vital information concerning the interaction between the test takers' and the placement examination, which can inform future decisions concerning the design and the performance of the placement examination. In addition, this study contributes to the larger issue of determining the extent to which background information influences test takers' comprehension of a given text.

[ p. 84 ]

Participants

This study involved 206 science majors and 204 non-science majors attending a national university near Tokyo. These first year Japanese students included 174 females and 236 males.

Materials

The placement examination involved two reading passages originating from the official website of the Nobel Prize (Nobel Foundation, 2005). The idea was to choose texts related to science which did not require a specialized knowledge about a specific field of study. One of the reading passages provided a brief historical account of scientific and entrepreneurial successes that Albert Nobel achieved during his life time, which ultimately led to the creation of the Nobel prizes (see Appendix A). The other reading passage was a brief historical account of how the structure of DNA was discovered and the controversy surrounding which scientists were recognized for this achievement (see Appendix B). Table 1 reports the descriptive features of each reading passage along with estimates of the text's level of readability using the Flesch Reading Ease Formula (Flesch, 1948) and Flesch-Kincaid Grade Level. The reading passages were also analyzed with Web Vocabprofile (Cobb, 2006; Heatley & Nation, 1994) to determine the percentage of words at the 1000-word level, the 2000-word level, the Academic Word List level, and the Off-list level (i.e. low frequency vocabulary).


Table 1. The descriptive features of the reading passages used in this study and estimates of their text difficulty
Nobel Passage DNA Passage
Number of words 367 356
Number of sentences 19 16
Average length of sentence 19.32 22.25
Flesch 45.23 44.28
Flesch-Kincaid 11.75 12.61
1000-word level 0.77 0.76
2000-word level 0.07 0.07
Academic Word List 0.07 0.05
Off-list words 0.10 0.12

In terms of the readability estimates, the DNA passage was slightly more difficult than the Nobel passage. The amount of coverage provided by the different vocabulary lists was relatively the same. The Nobel passage does, however, have a slightly better coverage at the 1,000 word-level and with the Academic Word List. The DNA passage has a slightly higher percentage of Off-list words.

[ p. 85 ]


Each reading passage had 21 true/false statements. These statements were designed to assess three types of reading comprehension: literal comprehension, reorganization, and inference. According to Day and Park (2005), The development of these items was an involved process requiring a number of revisions. First, a pool of items was written, which were then submitted to a team of reviewers who categorized the items according to type of reading comprehension they required. Items that produced unanimous categorizations were added to an item bank. The other items were either revised or replaced and then submitted to another round of review. This process of item writing, review, and revision continued until the two reading passages had seven true/false statements designed to evaluate the three different types of reading comprehension. Four versions of the placement examination were then made to counterbalance the order of the reading passages and randomize the items in order to offset any tiredness/sequencing effect.
A follow-up questionnaire (see Appendix C) was also developed. In addition to a few biographical questions, the questionnaire asked students to rate their level of background knowledge about the reading passages; the difficulty level of the reading passages (in terms of vocabulary, grammar, and stylistic properties), and the amount of relevancy the reading passages had in relation to their studies at university. The questionnaire was written in Japanese and took approximately five minutes to complete.

Procedure

The test takers were given the placement examination during their regularly scheduled English class. The four versions of the placement examination were randomly distributed to the test takers along with optical mark card response sheets. The test takers had 20 minutes to complete the placement examination. Once they completed the examination, they completed the short follow-up questionnaire.

[ p. 86 ]

Analysis

"The CHIP scale is a useful way of conceptualizing the probability of test takers' success."

The Rasch model (Rasch, 1960/1980) was utilized to provide estimates of difficulty for the 42 items used in both reading passages. These estimates are placed on an equal-interval scale measured in logits. Since this unit of measurement is not widely familiar, they were transformed into a user-friendly scale known as CHIPs. The CHIP scale is a useful way of conceptualizing the probability of test takers' success (E. Smith, Jr., 2000). For example if there is no difference between a test taker's level of ability and an item's level of difficulty, then that person has a 50 percent chance of answering the item correctly. For the purposes of this study, a 50 percent chance of test taker success was set at 50 CHIPs. Thus, items with difficulty estimate exceeding 50 CHIPs meant that test takers had less than a 50 percent chance of correctly answering them on the average.

The opposite is true of items that have a level of difficulty less than 50 CHIPs. The CHIP scale also has the additional benefit that the probabilities of test takers correctly answering the different reading passage items are in easy-to-remember multiples of 5 (Wright & Stone, 1979). For example, Table 2 shows that an average test taker has a 10 percent chance of answering an item with a difficulty level of 60 CHIPs.


Table 2. The relationship between CHIP scores and an average test taker's chances of success
60 chips 55 chips 50 chips 45 chips 40 chips
.10 probability .25 probability .50 probability .75 probability .90 probability


Calculating item bias using Rasch measurement theory

There are a number of Rasch-based approaches that detect the possibility of item bias (R. Smith, 2004). This study used the separate calibration t-test approach implemented by WINSTEPS (Linacre, 2006). This approach involves the comparison of the item difficulties based upon the different subpopulations of test takers. This comparison involves five sequential steps (Linacre & Wright, 1989). These steps are:
  1. Running an initial Rasch analysis that includes all test takers (i.e. science and non-science majors). This analysis provides estimates of the test takers' English reading ability and the scale structure used to evaluate the test takers' responses. In the case of this placement examination, the scale structure was dichotomous indicating whether or not an answer was correct.
  2. Anchoring the estimates of test takers' reading ability and the scale structure. This process entails taking the estimates of test takers' reading ability and the scale structure from the first Rasch analysis, which included all of the test takers, and using these estimates in the subsequent Rasch analyses, which produce estimates of item difficulty for the different subgroups of test takers (i.e. science versus non-science majors). Anchoring test takers' ability estimates and the scale structure to the values from first Rasch analysis makes it possible to compare item difficulty estimates produced from different Rasch analyses that involve different groups of test takers.
  3. Running a second Rasch analysis using only the science majors' responses. Once again, the person ability estimates and the scale structure are anchored to the values produced by the initial Rasch analysis involving both science and non-science majors. The result of this analysis is the item difficulty estimates for the science majors.
  4. Conducting a third Rasch analysis using only the non-science majors' responses to produce the item difficulty estimates for this subgroup of test takers.
  5. Performing a series of pairwise t-tests involving the two sets of item difficulties for the science and the non-science majors. Since this procedure involves multiple comparisons for each item on the placement examination, the alpha level was divided by two (i.e. the Bonferroni adjustment) to offset the chances of a Type I error.

[ p. 87 ]

Results

Two points seemed significant in light of this study and they are summarized below.

Finding #1: The variability of item difficulty according to the test takers' field of study

The pairwise comparisons between the item difficulties for science and non-science majors found that 15 out of the 41 items on the placement examination (36.7%) were biased. Table 3 shows the extent of this bias from two complementary perspectives. The first perspective is the DIF contrast, which indicates the difference between science and non-science majors' measured in CHIPs. Negative differences mean that the science majors' item difficulty estimate was lower (easier) than the non-science majors' item difficulty estimate. Table 3 shows that there were 7 items which favored science majors. Positive differences, on the other hand, mean that the science majors' item difficulty estimate was higher (harder) than the non-science majors' item difficulty estimate. Table 3 shows that there were 8 items which favored non-science majors.


Table 3. The differences in item difficulties for science and non-science majors in the placement exam
Text Item type DIF Contrast Probability Contrast Favoring
Nobel Literal -3 0.16 Science
Nobel Organizational -5 0.17 Science
Nobel Organizational -2 0.13 Science
Nobel Organizational -3 0.14 Science
Nobel Inference -3 0.19 Science
Nobel Inference -2 0.13 Science
Nobel Inference -3 0.11 Science
DNA Organizational 3 0.13 Non-Science
DNA Organizational 4 0.15 Non-Science
DNA Organizational 3 0.14 Non-Science
DNA Inference 3 0.15 Non-Science
DNA Inference 4 0.16 Non-Science
DNA Inference 5 0.20 Non-Science
DNA Inference 3 0.11 Non-Science
DNA Inference 4 0.19 Non-Science

[ p. 88 ]


The second perspective used to report the differences found between the two groups in this study was the probability of the average test taker from each group to successfully answer the different items correctly. Once again negative differences indicate items that favored science majors; whereas, positive differences indicate items that favored non-science majors. Table 3 shows that the probability contrasts range from 0.11 to 0.20.
"One interesting pattern . . . is that the items favoring the science and the non-science majors were split between the two reading passages."

One interesting pattern arising from Table 3 is that the items favoring the science and the non-science majors were split between the two reading passages. In other words, the item difficulties for the Nobel reading passage were lower (easier) for science majors and the item difficulties for the DNA reading passage were lower (easier) for non-science majors.
In terms of item type, most of the significant differences found between science and non-science majors involved items designed to evaluate test takers' reorganization and inference reading skills.

Finding #2: The variability of test takers' perceptions of the reading passages according to their field of study

The test takers' responses to the follow-up survey were subjected to a MANOVA analysis. The test takers' field of study (i.e. science versus non-science major) served as the between group factor. The test takers' level of background knowledge about the reading passages, the level of difficulty they had with the reading passages, and the amount of relevancy the reading passages had in relation to their studies at university were the dependant variables for this analysis.
A significant difference was found between the science and non-science majors' responses (including Pillia's Trace, Wilks's Lambda, Hotelling's Trace, and Roy's statistics). Those interested in knowing more about the meaning of these various statistics should refer to Tabachnick, B., & Fidell (2001, p. 342-348).
As a consequence, a univariate one-way ANOVA was performed for each dependant variable to determine where the significant differences existed between the two subgroups of test takers. Once again, the Bonferroni adjustment was used to offset the chances of a Type I error for this series of follow-up analyses which resulted in a p-value of 0.008. Table 4 shows that science majors reported significantly higher levels of background knowledge about the reading passages. Science majors also reported significantly higher evaluations that the reading passages were relevant to their future studies. In terms of the level of difficulty that the test takers had reading the passages, there were no significant differences found between science and non-science majors.

[ p. 89 ]


Table 4. The difference between the science and non-science majors' responses on the follow-up questionnaire
Factor examined Major M SD Significant?
Nobel Knowledge science 4.04 0.96 *
non-science 3.73 0.96
Nobel Difficulty science 7.00 1.71
non-science 6.73 1.79 *
Nobel Relevancy science 4.84 1.43 *
non-science 4.38 1.30 *
DNA Knowledge science 4.40 1.37 *
non-science 4.11 1.35
DNA Difficulty science 6.61 2.19 *
non-science 6.17 2.29
DNA Relevancy science 4.76 1.79 *
non-science 4.17 1.67
* significant at p < 0.008


Discussion

This investigation found that there were a noteworthy number of items on the placement test that favored either science or non-science majors. Counter to expectation, these differences coincided with the different reading passages. Many of the test developers were quite surprised that 8 out of the 21 items for the DNA reading passage favored non-science majors. Initially, it was conjectured that the non-science majors had a greater level of background knowledge about this topic or non-science majors. The analyses of the follow-up questionnaire, however, did not support this hypothesis. Science majors reported significantly higher levels of background knowledge about both reading passages. There were also no significant differences found between science and non-science majors' reported level of difficulty to read the different passages.
We can therefore say the practice of selecting texts that contain content originating from the general field of science does not necessarily result in a placement examination that favors science majors even when they report a higher level of background knowledge. The science majors' significantly higher evaluations about the relevancy of the reading passages to their future studies at university, however, do provide the test designers with some empirical support for the face validity of the placement examination. This finding thus lends some support to the future use of the DNA reading passage.
Another important finding arising from this investigation is that items requiring reorganization and inferences for reading comprehension produced the most consistent and the largest differences between science and non-science majors. This finding suggests that there is a great deal of variability amongst test takers concerning their ability to combine information from different parts of a reading passage and draw inferences from what they have read. Identifying the strengths and weaknesses of test takers is not only essential to the process of streaming students into the appropriate classes, but it can also inform curriculum decisions concerning what needs to be taught.

[ p. 90 ]


Finally, the use of CHIP scores and probability contrasts provides designers of the placement examination with a meaningful frame of reference to evaluate item performance. The use of graphical output such as Figure 1 can also be extremely useful. In order to understand the full utility of this graphical output, a few lines explaining Figure 1 are in order. The shaded bands in Figure 1 refer to the probable levels of success that an average test taker has at the different levels of item difficulty.


Figure 1
Figure 1. A graphical representation of the probability of the average science and non-science major successfully answering the different items on this placement examination

For example, the average science major (indicated in black) has a 53 percent chance of correctly answering item 1. In comparison, the average non-science major (indicated in pink) has a 50 percent chance. The close proximity of these probabilities indicates that item 1 does not favor one subgroup of test takers over another. In sharp contrast, item 11 heavily favors science majors. Figure 1 shows that the average science major has a 90 percent chance of correctly answering this item, while the average non-science major has a 73 percent chance. The difference of 6 CHIPs (i.e. -5 DIF contrast) thus becomes meaningful when it is considered in conjunction with the probability bands that stretch across Figure 1.Test developers can then decide whether or not a probability contrast of 17 percent is too great when selecting items for the final version of the placement examination.

[ p. 91 ]


Conclusion

This paper demonstrates the wealth of information that arises from the development of a placement examination designed to meet the specific needs of a university. The present study has led to a greater understanding of the performance of different reading passages and their related test items. Although the passages were assumed to be roughly equivalent, the Nobel passage was found to be easier than the DNA passage. The Rasch-based DIF analysis not only pinpointed where the differences between the passages existed, but also revealed that a number of the test items favored different subgroups of test takers. These findings can in turn inform decisions concerning the further development of the placement examination and the reading curriculum as a whole.
It must be remembered that placement examinations are constantly in need of close monitoring and adjustment when need be. This particular placement examination, for example, focuses exclusively upon three levels of reading comprehension. Other aspects of reading such as vocabulary knowledge, reading fluency, or the appropriate use of different reading strategies are also important factors to consider when streaming students into an EFL reading program. Increasing the number of reading passages and using a variety of different types of texts might not only provide a more comprehensive account of students' ability to read English, but also contribute to a greater understanding of the relationship between students' previous background knowledge and their level of reading comprehension.
Although the focus of this paper addressed the extent to which items on a placement examination favored two different student groups, it also suggests how a Rasch-based DIF analysis can also be used to address a number of other issues involving different subgroups of test takers as well as different types of L2 reading passages. Possible future research questions might include:
  1. To what extent does the level of item difficulty vary according to the gender of the test taker?
  2. To what extent does the level of item difficulty vary across different faculties or departments at the university?
  3. To what extent does the level of item difficulty vary across different types of passages (e.g. scientific versus non-scientific)?

These types of questions will undoubtedly help clarify the important interaction that occurs between test items and test takers. Moreover, they will also contribute to the larger endeavor of articulating what reading in second language entails. The present investigation of a placement examination designed for an EFL reading program makes significant contributions to both of these areas of interest. Moreover, the findings of this study can help to ensure that future placement examinations are fair and sensitive measures of test takers' reading abilities.

References

Brantmeier, C. (2003). Does gender make a difference? Passage content and comprehension in second language reading. Reading in a foreign language, 15 (1), 1-27.

Carrell, P. (1987). Content and formal schemata in ESL reading. TESOL Quarterly, 21 (3), 462-481.

Carrell, P., & Eisterhold, J. (1988). Schema theory and ESL reading pedagogy. In P. Carrell, J. Devine & D. Eskey (Eds.), Interactive approaches to second language reading. New York: Cambridge University Press.

[ p. 92 ]


Chapman, M. (2005). TOEIC: Claim and counter-claim. The Language Teacher, 29 (12). PAGE REFERENCE AND/OR URL NEEDED

Cobb, T. (2006). Web Vocabprofile [Computer software]. Retrieved March 12, 2007, from http://www.lextutor.ca/vp/ [an adaptation of Heatley & Nation's (1994) Range.

Day, R., & Park, J. (2005). Developing reading comprehension questions. Reading in a foreign language, 17 (1), 60-73.

Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32, 221-233.

Heatley, A., & Nation, P. (1994). Range. Université du Québec à Montréal. [Computer program, available at http://www.lextutor.ca/range/].

Linacre, J. (2006). WINSTEPS Rasch measurement computer program (Version 3.60) [Computer software]. Chicago: Winsteps.com.

Linacre, J., & Wright, B. (1989). Mantel-Haenszel DIF and PROX are equivalent! Rasch Measurement Transactions, 3 (2), 52-53.

Nobel Foundation. (2005). The Ultimate Source for Nobel Prize Information. Retrieved March 16th, 2005, from http://www.nobelprize.org/

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. (Expanded edition). Chicago: The University of Chicago Press.

Smith, E., Jr. (2000). Metric development and score reporting in Rasch measurement. Journal of Applied Measurement, 1 (3), 303-326.

Smith, R. (2004). Detecting item bias with the Rasch model. Journal of Applied Measurement, 5 (4), 430-449.

Tabachnick, B., & Fidell, L. (2001). Using multivariate statistics (4th ed.). Needham, MA: Allyn and Bacon.

Wright, B., & Stone, M. (1979). Best test design: Rasch measurement. Chicago: Mesa Press.


2007 Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index
Complete Pan SIG-Proceedings: Topic Index Author Index Page Index Title Index Main Index

Last Main Next

[ p. 93 ]