A Rasch-based analysis of an in-house English placement test

Second Language Acquisition - Theory and Pedagogy: Proceedings of the 6th Annual JALT Pan-SIG Conference.
May. 12 - 13, 2007. Sendai, Japan: Tohoku Bunka Gakuen University. (pp. 97 - 109)

A Rasch-based analysis of an in-house English placement test

by Yuji Nakamura (Keio University)

Abstract

Placement testing is probably one of the most widespread uses of tests. Though some institutions still choose commercially produced proficiency tests for placement purposes, an institution's placement test should be closely linked with its curriculum. This paper argues for in house tests by institutions to meet their needs. The present research describes a reliability and validity of a placement test with four sub-components used by the Faculty of Letters at Keio University. This article concludes with a discussion of the research methodology.

Keywords: placement testing, Rasch analysis, test validity, test reliability
プレースメントテストは最近多くの大学で行われている。その中には市販のテストをこの目的のために利用しているところもあるが、特定の大学のプレースメントテストはその大学のカリキュラムと密接に結びついているはずなので、理想的には当該の学生のレベル、大学の目標にあった独自のプレイスメント作りが行われるべきである。本稿は、慶應義塾大学文学部で実施したプレースメントテスト（4つの下部セクションから成り立っている：文法、語彙、読解、クローズテスト）の信頼性、及び妥当性について検証結果をのべ、今後の研究の方向性について考察を加えようとするものである。

キーワードプレースメントテスト、ラッシュ分析法、テストの妥当性、テストの信頼性

プレースメントテストは最近多くの大学で行われている。その中には市販のテストをこの目的のために利用しているところもあるが、特定の大学のプレースメントテストはその大学のカリキュラムと密接に結びついているはずなので、理想的には当該の学生のレベル、大学の目標にあった独自のプレイスメント作りが行われるべきである。本稿は、慶應義塾大学文学部で実施したプレースメントテスト（4つの下部セクションから成り立っている：文法、語彙、読解、クローズテスト）の信頼性、及び妥当性について検証結果をのべ、今後の研究の方向性について考察を加えようとするものである。

キーワードプレースメントテスト、ラッシュ分析法、テストの妥当性、テストの信頼性

The Faculty of Letters at Keio University primarily aims to improve students' reading ability to further enhance their learning. For that purpose a placement test is needed to accurately place the students into their appropriate proficiency levels to optimize their learning experiences, and to provide multi-faceted English communicative instruction. The purpose of this placement test is to measure the incoming 2006 students' English reading ability and English proficiency to provide streamed instruction.

The goals of this project are threefold:

to offer EFL four levels of classes for students according to their English reading ability as ascertained by the method below,
to offer classes for those who, according to the method below, need remedial instruction
to offer classes for those who have already reached the required level that desire further study

[ p. 97 ]

Reading ability is thought to consist of grammar knowledge, vocabulary knowledge, long passage reading comprehension ability with full context (in other words, the text material does not have any deleted words or blanks intended for other questions), and passage understanding ability without sufficient context or information. In other words, a test of reading ability should be composed of grammar test, vocabulary test, reading comprehension test and cloze test.

". . . a placement test must be specifically related to a given program."

Although commercialized tests such as the TOEFL-ITP, TOEIC-IP, G-TELP, Step-EIKEN and CASEC exist, it was agreed among the faculty members that the content, level and purpose of those tests were not appropriate for placing students in the literature department. Furthermore, the admissions test could not be used for any other purpose than the entrance examination selection. Admission tests are basically used for screening purposes only. Since there has been a variety of admission tests conducted these days such as ( admissions office tests, interview tests, the center-test, the high-school recommendation test), the so-called admission tests do not seem to be functioning well as placement purposes. In addition, people are so concerned about the privacy security issue even on the test scores of the admission tests, it seems extremely difficult to use the admission test results for streaming instruction purposes.

Tests can be valid or not depending on whether they agree with the purpose of the test users. The purpose of the aforementioned tests does not seem to fit the purpose of the faculty of letters of Keio University. For example, we are not solely intent on measuring students who will study overseas, or assessing the skills of students who will start business communication after graduation. Our purpose of this project is to encourage students to develop their English reading ability, which is indispensable for their major area studies. Almost all the students in the faculty of letters are required to read materials in English whether their major is English or not. For these reasons, we have decided to develop our own placement test.

Commenting on this point, Westrick suggests:

More studies on the use of commercially-produced tests and in-house tests for placement purposes at other Japanese colleges and universities are needed. Creating an effective placement test involves developing test items related to a true curriculum with clear goals and objectives, piloting the tests items, analyzing the data, and revising the tests to ensure that the scores are reliable and sound placement decisions can be made. This requires hard work, but it must be done if fair and defensible placement decisions are to be made. (2005, p.90)

Furthermore, a number of other scholars take a similar stance about the placement test. Brown (1996, p. 12) says that a placement test must be specifically related to a given program. Hughes (2003, p. 16) claims that placement tests should be developed by the users themselves so that they specifically meet their needs. And, Fulcher argues by stating:

The goal of placement testing is to reduce to an absolute minimum the number of students who may face problems or even fail their academic degrees because of poor language ability or study skills. (1997, p. 113)

[ p. 98 ]

Purpose of this study

The purpose of the present study is to examine the pilot version of a placement test and decide whether the real version of the test should have the same format.

McNamara (2000, p. 83) states, "There are three basic critical dimensions of tests – validity, reliability, and feasibility, whose demands need to be balanced." McNamara (2000, pp. 50-51) also mentions three aspects that can threaten test validity: (1) test content, (2) test method and (3) test construct.

Taking these facets of a test into consideration, this study seeks to examine whether the pilot version of this particular placement test has enough validity, reliability and practicality to merit further implementation. This overall question gives rise to the following hypotheses:

Hypothesis 1: The test does not have enough validity.

From a Rasch perspective, validity denotes the degree that observed research results fit a given model. The construct validity in the Rasch model is investigated through the examination of five steps: (1) chi-square examination, (2) fitsresidual examination, (3) location examination, (4) item characteristic curve, and (5) targeting information.

Among these, the item analysis using the item characteristic curve (ICC) is the main focus of this present research because this can make a great contribution to a better improvement of the revised test. The ICC tells you how the item curve fits the model . In other words, it can give us a piece of information of the construct validity. Also, it indicates whether the item discriminates the students well or not. Along with the ICC, the information of distracters will be discussed as well.

"A [placement] test is said to have content validity if the questions reflect the course content or syllabus."

Also, the content validity of this test will be discussed in a non-statistical way. A test is said to have content validity if the questions reflect the course content or syllabus. A test is said to have face validity if the test stakeholders think that the test is measuring what it should. In the discussion of content validity, the test construct and the test method are additionally discussed. The test construct will be discussed in terms of the construct of the difficulty order of the subsections. The test method discussion will focus on how the test was planned, administered and scored. The face validity will be investigated through examinee questionnaire results.

Hypothesis 2: The test does not have acceptable reliability.

The reliability is investigated by the person separation index, which is equivalent to the cronbach alpha. A widely accepted benchmark for the person separation index is 0.7 or more. This pilot placement test was developed to examine these hypotheses in relation to the main research question.

[ p. 99 ]

Method

Subjects

809 first year university students in the Faculty of Letters of Keio University.

Materials/ instruments

A 50-item multiple-choice placement test with four components was used in this study. The test material contained 15 grammar MC questions, 10 vocabulary MC questions, three long reading passages with 5 MC questions each, and 10 cloze MC questions. Applicants had 60 minutes to complete this test, which was scored by optical readers. The reading section consisted of one beginning level, one intermediate level, and one advanced level passage about 400-500 words in length. Difficulty was rated impressionistically by teachers in terms of content, topic, and vocabulary level.

Procedure

Test construction

Nakamura (1998, p. 260) proposed four points to consider in assessing reading ability: (1) the nature of reading, (2) the theoretical or linguistic underpinnings of reading, (3) the test format of reading, 4) classroom teachers' ideas based on their teaching experiences. The construct of "reading ability" for this test was established mainly from these plus the specific aspects of the faculty of letters as follows:

the teachers' teaching experience with the reading sections of other existing tests linguistic theories (Alderson, 2000; Grabe, 2000; Hughes, 2003)
the needs of the Mita campus where students are required to read the major books and references for their study areas. In other words, the required reading ability at the Mita campus.
the text books that are actually used in students' study areas.

The grammar items were chosen by taking into consideration almost all of the grammar items that were supposed to have been mastered at the high school level. There are textbooks authorized by the Ministry of Education and are available at bookstores. Since we did not pretest items in order to determine their difficulty empirically, we relied on theory to create items and sections at different ability levels. For example, the vocabulary items were based on word frequency counts using the benchmark of English Japanese dictionaries are available at bookstores, the grammar items were based on developmental sequences and on the written structures on textbook analysis. The textbooks authorized by the Ministry of Education available at bookstores.

The reading passages were selected from the three disciplines (humanities, social sciences and natural sciences), and appropriate vocabulary levels were taken into consideration. The text passages were analyzed using L1 Flesch Reading Ease (Readability Formula) together with the judgments of experienced teachers.

Test analysis

The test data was analyzed using the RUMM statistical program 2020. The Chi-Square was investigated to determine if there was a huge gap among neighboring scores. The benchmark for the acceptable range for the FitResiduals scores was between -3 and +3. The location order was examined to obtain the construct of the item difficulty order. The item characteristic curves were examined to check the discriminating power of each item. The benchmark for the person separation index of the test reliability was set at 0.7 or over.

[ p. 100 ]

Results and Discussion

In the explanation below four types of abbreviations will be used: G stands for grammar, V stands for vocabulary, R for teading and C for cloze.

Chi-Square order

In order to check if the response fits the model , Chi-square is used. The column Chi-square means the smaller the better, while the column probability (to show the magnitude) means the bigger the better. Also, we examine if there is a big gap between the neighboring items in the Chi-square order.

Table 1. Chi-square order of the 2006 pilot placement test items
Table 1

This table shows that three items (C47, R29, and G11) need to be examined because there was a gap of 8 Chi-square points or scores from the neighboring items in the column of Chi-square.

[ p. 101 ]

FitResidual order

FitResidual is used to check if the item has a discriminating power or not. The acceptable range is usually from -3 to +3. The negative residual means overdiscriminating (overfitting), while the positive residual means underdiscriminating (underfitting).

Table 2. FitResidual order of the 2006 pilot placement test items
Table 2

According to the benchmark of the acceptable range (-3 to 3), among the three items pointed out in the Chi-square investigation, R29 is regarded to be overfitting (overdiscriminating) and G11 is considered underfitting (underdiscriminating). Based on this Chi-square and FitResidual information, three items (R29 and G11 and C47) appear to be problematic. They need to be investigated further in terms of location order.

[ p. 102 ]

Location Order

The Fitresidual indicates the discriminating power (-3 to +3) and the negative means over discriminating while the positive means underdiscriminating. On the other hand, the Location order is an indication of the difficulty item of the items. The usual range is (-3 to +3).

Table 3. Location order of the pilot placement test items

-----------------------------------------------------------------------------
Seq  Item   Type   Location   SE    Residual   DF      ChiSq  DF    Prob     
-----------------------------------------------------------------------------
28   R28     MC     -3.234   0.253   -0.868  781.00   10.653   9  0.300268
10   G10     MC     -2.098   0.153   -1.534  781.00   17.766   9  0.037983
14   G14     MC     -1.953   0.145   -1.503  781.00   15.122   9  0.087630
29   R29     MC     -1.702   0.131   -3.077  781.97   45.955   9  0.000001
27   R27     MC     -1.606   0.127   -1.153  781.97    8.169   9  0.517196
13   G13     MC     -1.311   0.114   -0.772  780.02   37.366   9  0.000023
34   R34     MC     -1.306   0.114   -1.958  781.00   19.300   9  0.022759
39   R39     MC     -1.207   0.110   -1.976  778.06   23.730   9  0.004751
2    G02     MC     -1.191   0.110   -0.572  781.97    4.759   9  0.854781
26   R26     MC     -1.169   0.109   -0.631  781.97   13.323   9  0.148513
46   C46     MC     -1.016   0.104   -2.359  775.12   23.194   9  0.005777
7    G07     MC     -0.997   0.103   -2.294  781.00   27.909   9  0.000988
40   R40     MC     -0.820   0.098   -0.874  778.06    8.037   9  0.530397
1    G01     MC     -0.680   0.094   -1.467  781.97   15.424   9  0.079919
45   C45     MC     -0.526   0.091   -0.768  777.08   14.598   9  0.102588
8    G08     MC     -0.448   0.089    0.661  781.00    6.283   9  0.711335
24   V24     MC     -0.393   0.088    2.409  781.97   35.324   9  0.000052
3    G03     MC     -0.227   0.085   -3.647  780.02   32.560   9  0.000159
37   R37     MC     -0.103   0.083   -1.914  779.04   18.754   9  0.027367
6    G06     MC     -0.064   0.082   -2.258  781.00   18.563   9  0.029180
36   R36     MC     -0.034   0.082    0.430  774.14   12.346   9  0.194485
15   G15     MC     -0.024   0.081    0.012  780.02   14.703   9  0.099426
12   G12     MC      0.057   0.080    1.072  780.02   13.304   9  0.149347
50   C50     MC      0.103   0.081    1.497  764.36   29.045   9  0.000637
48   C48     MC      0.172   0.080    0.518  771.21    9.040   9  0.433593
38   R38     MC      0.209   0.079    0.215  779.04    7.787   9  0.555765
31   R31     MC      0.218   0.079    1.230  780.02    8.772   9  0.458594
23   V23     MC      0.284   0.078    0.658  781.00    6.175   9  0.722263
5    G05     MC      0.341   0.077    0.211  779.04   11.815   9  0.223935
33   R33     MC      0.347   0.077   -0.186  781.97    9.051   9  0.432568
18   V18     MC      0.405   0.077    1.989  778.06    9.579   9  0.385588
41   C41     MC      0.484   0.076    0.962  780.02   13.116   9  0.157419
21   V21     MC      0.527   0.076   -1.753  781.00   17.097   9  0.047220
32   R32     MC      0.533   0.076   -0.613  781.97   10.796   9  0.289949
9    G09     MC      0.652   0.075   -0.882  780.02   15.579   9  0.076204
4    G04     MC      0.672   0.075    3.148  781.00   18.594   9  0.028875
19   V19     MC      0.723   0.075    4.998  780.02   14.843   9  0.095339
25   V25     MC      0.728   0.075    1.496  781.97    4.276   9  0.892325
22   V22     MC      0.748   0.075    0.978  776.10   12.365   9  0.193523
43   C43     MC      0.802   0.075    0.976  780.02    7.183   9  0.618101
42   C42     MC      0.902   0.075    1.848  778.06   13.526   9  0.140218
16   V16     MC      0.975   0.075    4.733  780.02   21.973   9  0.008967
17   V17     MC      0.978   0.075    2.545  780.02   15.440   9  0.079538
44   C44     MC      1.116   0.076    3.518  771.21   19.936   9  0.018314
11   G11     MC      1.255   0.076    5.942  780.02   61.082   9  0.000000
30   R30     MC      1.353   0.076    0.042  781.97    6.548   9  0.684044
20   V20     MC      1.483   0.077   -0.294  781.00    8.577   9  0.477171
35   R35     MC      1.499   0.077   -0.482  780.02   14.418   9  0.108223
49   C49     MC      1.975   0.084    2.342  764.36   30.129   9  0.000417
47   C47     MC      2.570   0.095    2.698  774.14   45.466   9  0.000001
------------------------------------------------------------------------------------------------------

R 29 is the closest to the easiest item (the fourth easiest) in the order, while G11 is the sixth most difficult one. C47, which was pointed out as problematic in terms of its Chi-Square order, is the most difficult one. Also, the location order shows that Reading items and Grammar items tend to be placed on the easier side of the continuum while Vocabulary items and Cloze items are relatively difficult. In other words, the location order shows the construct of item difficulty order.

[ p. 103 ]

Item characteristic curve (ICC)

ICC is used to show in detail the degree of agreement between the observed proportions and the theoretical curve.

Figure 1. An examination of item G11 and its item characteristic curve (ICC) in the 2006 pilot placement test

This ICC of G11 shows us that the less able students performed better than anticipated. It also indicates that the more able students performed more poorly than anticipated. This further shows that this item did not discriminate well between lower level students and intermediate level students. A likely reason is that this item was a little too difficult (1.255 logits above the mean). This item probably would function better to differentiate the more able students at the top end. From the lower end to the mid group, there is no discrimination, even negative discrimination. From the mid to the top it has some discriminating power, but still not as much as anticipated. In short this test did not yield three separate groups as clearly as wished.

Figure 2. An examination of item R29 and its item characteristic curve (ICC) in the 2006 pilot placement test

This ICC of R29 shows that the item is problematic because it is overdiscriminating. We can tell the difference between the lower and the intermediate level students. However, it does not discriminate among the top level students. The top level students seem to have some advantage or bias about the topic. The less able students do not fit the model. The lower end is over discriminating, and the lower group is performing more poorly than anticipated.

[ p. 104 ]

Figure 3. An examination of item C47 and its item characteristic curve (ICC) in the pilot placement test

This ICC of C47 indicates that the item has no discriminating power. All the groups get the item correct under the guessing level. This is probably the reason it was pointed out as problematic in terms of its Chi-Square order.

So far, only three items have been pointed out as problematic. And when we think about the percentage of these three problematic items, they are just three out of 50, or 6% of the whole. However, it may seem too quick to conclude that this figure is relatively slight.

Distracter curve information

Grammar Section

Now let's examine Figure 4, which demonstrates the distracter information curve. The distracter information curve measures indicate how the distracters (option answers) are functioning to be attractive to the test takers.

Figure 4. A description of a key answer and the three distracters of item G11 in the 2006 pilot placement test

Figure 4 looks strange because up to 1.2, the key and the distracters functioned in a confusing way. After 1.2 ability level, the key answer functioned properly. Between 0.7 and 1.0, the students preferred Option 2 to the key answer. The lower and upper ability students in this case, got that item correct.

[ p. 105 ]

Reading section

Let us take a look at how the distracters are behaving in the following item.

Figure 5. A description of a key answer and the three distracters of item R29 in the 2006 pilot placement test

For R 29: the key answer functioned well - other distracters were less common than the key answer. This item appears to be a reasonably good distracter.

Cloze Section

Let us look at how the distracters are functioning in the following item.

Figure 6. A description of a key answer and the three distracters of item C47 in the 2006 pilot placement test

Item C47 is strange and difficult. The three distracters and the key answer do not function at any level. Even the key is chosen under the guessing level. It seems that there are two correct answers with Option 3 as the most popular. All the students misunderstand the concept. The key answer is not discriminating.

[ p. 106 ]

Information targeting

Information targeting shows the relative position between persons (person ability) and items (item difficulty). The following graph shows us that we need some more difficult items to match the more able students in the future version.

Figure 7. Relative positions between persons (person ability) and items (item difficulty) in the 2006 placement test through an information targeting graph

This figure suggests that as a whole the test was very good at measuring students' English proficiency. For future improvement, more difficult items are needed to match the more able students at the top of this continuum.

Examination of reliability

The reliability was investigated by the person separation index, which is akin to the Cronbach Alpha. The benchmark for the acceptable boundary is over 0.7. The reliability of this placement test had a score of 0.78 in terms of the person separation index. This suggests the items in this test were internally consistent.

[ p. 107 ]

Summary of the results and discussion

This study explored whether the pilot version of this placement test had enough validity, reliability to proceed to the real test. Three null hypotheses pertaining to the primary research question were rejected.

The ability level of the top group was higher than the difficulty level of items in most of the cases. On the other hand, the ability of the bottom group who need remedial instruction was below the difficulty level of reading section and the grammar section in most of the cases.

". . . validity examinations should be conducted in detailed ways which include concurrent validity and/or factor analysis methods."

Hypothesis 1, "The test does not have enough validity," was not verified. The discovery of three problematic items was a minor defect from the viewpoint of the whole test. In other words, 94% the test items fit the model, which technically verifies the construct validity of the test. However, validity examinations should be conducted in detailed ways which include concurrent validity and/or factor analysis methods.

Hypothesis 2, "The test does not have acceptable reliability," was not verified. The reliability was investigated by the person separation index, and had an acceptable boundary of over 0.7. Accordingly, the alternate hypothesis "The test is reliable" was accepted.

The 2007 version of the test should explore the issue of face validity and practicality more systematically. Since the 2006 test was investigated mainly in terms of reliability and construct validity, further research is needed to fully corroborate this test.

Conclusions and implications

The Research Question for this study was partially supported with the examination of the three presuppositions. Also, the information obtained from the person-item relative position helped us divide the students into appropriate groups. However, there were not enough difficult questions at the high end of the spectrum to create three levels. That is not a problem of test design, but rather item content. In future research this problem should be solved.

Considering McNamara's (2000, p. 83) statement "The right balance of three basic critical dimensions of tests – validity, reliability and practicality – will depend on the test context and test purpose," the present placement test should be regarded as acceptable judging from the statistical analyses and the test context as well as the test purpose. For future improvement, the predictive validity should be investigated as well as the concurrent validity, the factor analysis and the multi-trait multi-method (MTMM) analysis for the test validation. Future studies should also explore the issue of face validity and practicality more systematically.

[ p. 108 ]

Acknowledgement
The present author is grateful to Dr. David Andrich and Dr. Irene Styles for their invaluable comments and professional help.

References and Bibliography

Alderson, J.C. (2000). Assessing reading. New York: Cambridge University Press.

Andrich, D., Sheridan, B. & Luo, G. (2004). RUMM 2020: Rasch Unidimensional Measurement Models[computer software]. Perth, Western Australia: RUMM Laboratory.

Bachman, L.F. (1999). Fundamental considerations in language testing. Oxford: Oxford University Press.

Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment. New Edition. New York: McGraw-Hill.

Fulcher, G. (1997). An English language placement test: issues in reliability and validity. Language Testing 14, 2, 113-138.

Grabe, W. (2000). Reading research and its implications for reading assessment. In A. Kunnan (Ed.), Fairness and validation in language assessment (pp. 226 - 62). Cambridge: Cambridge University Press.

Hughes, A. ( 2003). Testing for Language Teachers. Cambridge: Cambridge University Press.

Linacre, M. (2004). WINSTEPS Rasch Measurement computer program (Version 3.51). Chicago: Winsteps.com.

McNamara, T. (2000). Language Testing. Oxford: Oxford University Press.

Nakamura, Y. (1998). Components of Reading Ability. Educational Studies, 40. 259-281. International Christian University.

Westrick, P. (2005). Score Reliability and Placement Testing. JALT Journal 27, 1, 71-92.

2007 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 109 ]