| by Yuji Nakamura (Keio University)
 | 
| 
 Keywords: placement testing, Rasch analysis, test validity, test reliability  | 
 The goals of this project are threefold:
The goals of this project are threefold:
[ p. 97 ]
 Reading ability is thought to consist  of grammar knowledge, vocabulary knowledge, long passage reading comprehension ability with full context (in other 
words, the text material does not have any deleted words or blanks intended for other questions), and passage understanding ability without sufficient 
context or information.  In other words,  a test of reading ability should be  composed of grammar test, vocabulary test,  reading comprehension test 
and  cloze test.
Reading ability is thought to consist  of grammar knowledge, vocabulary knowledge, long passage reading comprehension ability with full context (in other 
words, the text material does not have any deleted words or blanks intended for other questions), and passage understanding ability without sufficient 
context or information.  In other words,  a test of reading ability should be  composed of grammar test, vocabulary test,  reading comprehension test 
and  cloze test.  
| ". . . a placement test must be specifically related to a given program." | 
 Although commercialized tests such as the TOEFL-ITP, TOEIC-IP, G-TELP, Step-EIKEN and CASEC exist, it was agreed among the faculty members that the 
content, level and purpose of those tests were not appropriate for placing students in  the literature department.  Furthermore, the admissions test 
could not be used for any other purpose than the entrance examination selection. Admission tests are basically used for screening purposes only.  Since 
there has been a variety of admission tests conducted these days such as ( admissions office tests, interview tests, the center-test, the high-school 
recommendation test), the so-called admission tests do not seem to be functioning well as placement purposes.  In addition, people are so concerned about 
the privacy security issue even on the test scores of the admission tests, it seems extremely difficult to use the admission test results for streaming 
instruction purposes.
Although commercialized tests such as the TOEFL-ITP, TOEIC-IP, G-TELP, Step-EIKEN and CASEC exist, it was agreed among the faculty members that the 
content, level and purpose of those tests were not appropriate for placing students in  the literature department.  Furthermore, the admissions test 
could not be used for any other purpose than the entrance examination selection. Admission tests are basically used for screening purposes only.  Since 
there has been a variety of admission tests conducted these days such as ( admissions office tests, interview tests, the center-test, the high-school 
recommendation test), the so-called admission tests do not seem to be functioning well as placement purposes.  In addition, people are so concerned about 
the privacy security issue even on the test scores of the admission tests, it seems extremely difficult to use the admission test results for streaming 
instruction purposes.
 Tests can be valid or not depending on whether they agree with the purpose of the test users.  The purpose of the aforementioned tests does not seem to 
fit the purpose of the faculty of letters of Keio University.  For example, we are not solely intent on measuring students who will study overseas, or 
assessing the skills of students who will start business communication after graduation.  Our purpose of this project is to encourage students to develop 
their English reading ability,  which is indispensable  for their major area studies. Almost all the students in the faculty of letters are required to 
read materials  in English  whether their  major is English or not. For these reasons, we have decided to develop our own placement test.
Tests can be valid or not depending on whether they agree with the purpose of the test users.  The purpose of the aforementioned tests does not seem to 
fit the purpose of the faculty of letters of Keio University.  For example, we are not solely intent on measuring students who will study overseas, or 
assessing the skills of students who will start business communication after graduation.  Our purpose of this project is to encourage students to develop 
their English reading ability,  which is indispensable  for their major area studies. Almost all the students in the faculty of letters are required to 
read materials  in English  whether their  major is English or not. For these reasons, we have decided to develop our own placement test.
 Commenting on this point, Westrick suggests:
Commenting on this point, Westrick suggests:
More studies on the use of commercially-produced tests and in-house tests for placement purposes at other Japanese colleges and universities are needed. Creating an effective placement test involves developing test items related to a true curriculum with clear goals and objectives, piloting the tests items, analyzing the data, and revising the tests to ensure that the scores are reliable and sound placement decisions can be made. This requires hard work, but it must be done if fair and defensible placement decisions are to be made. (2005, p.90)
 Furthermore, a number of other scholars take a similar stance about the placement test. Brown (1996, p. 12) says that a placement test must be specifically 
related to a given program.  Hughes (2003, p. 16) claims that placement tests should be developed by the users themselves so that they specifically meet 
their needs. And, Fulcher argues by stating:
Furthermore, a number of other scholars take a similar stance about the placement test. Brown (1996, p. 12) says that a placement test must be specifically 
related to a given program.  Hughes (2003, p. 16) claims that placement tests should be developed by the users themselves so that they specifically meet 
their needs. And, Fulcher argues by stating:
The goal of placement testing is to reduce to an absolute minimum the number of students who may face problems or even fail their academic degrees because of poor language ability or study skills. (1997, p. 113)
[ p. 98 ]
Purpose of this study The purpose of the present study is to examine the pilot version of a placement test and decide whether the real version of the test should have the same format.
The purpose of the present study is to examine the pilot version of a placement test and decide whether the real version of the test should have the same format.
 McNamara (2000, p. 83) states, "There are three basic critical dimensions of tests – validity, reliability, and feasibility, whose demands need to be 
balanced." McNamara (2000, pp. 50-51) also mentions three aspects that can threaten test validity: (1) test content, (2) test method and (3) test construct.
McNamara (2000, p. 83) states, "There are three basic critical dimensions of tests – validity, reliability, and feasibility, whose demands need to be 
balanced." McNamara (2000, pp. 50-51) also mentions three aspects that can threaten test validity: (1) test content, (2) test method and (3) test construct.
 Taking these facets of a test into consideration, this study seeks to examine whether  the pilot version of this particular placement test has enough 
validity, reliability and practicality to merit further implementation. This overall question gives rise to the following hypotheses:
Taking these facets of a test into consideration, this study seeks to examine whether  the pilot version of this particular placement test has enough 
validity, reliability and practicality to merit further implementation. This overall question gives rise to the following hypotheses: 
 From a Rasch perspective, validity denotes the degree that observed research results fit a given model. The construct validity in the Rasch model is 
investigated through the examination of five steps: 
(1) chi-square examination, 
(2) fitsresidual examination, 
(3) location examination, 
(4) item characteristic curve, and 
(5) targeting information.
From a Rasch perspective, validity denotes the degree that observed research results fit a given model. The construct validity in the Rasch model is 
investigated through the examination of five steps: 
(1) chi-square examination, 
(2) fitsresidual examination, 
(3) location examination, 
(4) item characteristic curve, and 
(5) targeting information. 
 Among these, the item analysis using the item characteristic curve (ICC) is the main focus of this present research because this can make a great 
 contribution to a better improvement of the revised test. The ICC tells you how the item curve fits the model . In other words, it can give us a 
 piece of information of the construct validity.  Also, it indicates whether the item discriminates the students well or not. Along with the ICC, 
 the information of distracters will be discussed as well.
 Among these, the item analysis using the item characteristic curve (ICC) is the main focus of this present research because this can make a great 
 contribution to a better improvement of the revised test. The ICC tells you how the item curve fits the model . In other words, it can give us a 
 piece of information of the construct validity.  Also, it indicates whether the item discriminates the students well or not. Along with the ICC, 
 the information of distracters will be discussed as well.
| "A [placement] test is said to have content validity if the questions reflect the course content or syllabus." | 
 Also, the content validity of this test will be discussed in a non-statistical way.  A test  is said to have content validity if the questions reflect 
the course content or syllabus. A test is said to have face validity if the test stakeholders think that the test is measuring what it should. 
In the discussion of content validity, the test construct and the test method are additionally discussed. The test construct will be discussed in 
terms of the construct of the difficulty order of the subsections. The test method discussion will focus on how the test was planned, administered 
and scored. The face validity will be investigated through examinee questionnaire results.
Also, the content validity of this test will be discussed in a non-statistical way.  A test  is said to have content validity if the questions reflect 
the course content or syllabus. A test is said to have face validity if the test stakeholders think that the test is measuring what it should. 
In the discussion of content validity, the test construct and the test method are additionally discussed. The test construct will be discussed in 
terms of the construct of the difficulty order of the subsections. The test method discussion will focus on how the test was planned, administered 
and scored. The face validity will be investigated through examinee questionnaire results.
 The reliability is investigated by the person separation index, which is equivalent to the cronbach alpha.  A widely accepted benchmark for the person
separation index  is  0.7 or more. 
This pilot placement test was developed to examine these hypotheses in relation to  the main research question.
The reliability is investigated by the person separation index, which is equivalent to the cronbach alpha.  A widely accepted benchmark for the person
separation index  is  0.7 or more. 
This pilot placement test was developed to examine these hypotheses in relation to  the main research question. 
[ p. 99 ]
 A 50-item multiple-choice placement test with four components was used in this study.
The test material contained 15 grammar MC questions, 10 vocabulary MC questions, three long reading passages with 5 MC questions each, 
and 10 cloze MC questions. Applicants had 60 minutes to complete this test, which was scored by optical readers. 
The reading section consisted of one beginning level, one intermediate level, and one advanced level passage about 400-500 words in 
length. Difficulty was rated impressionistically by teachers in terms of content, topic, and vocabulary level.
A 50-item multiple-choice placement test with four components was used in this study.
The test material contained 15 grammar MC questions, 10 vocabulary MC questions, three long reading passages with 5 MC questions each, 
and 10 cloze MC questions. Applicants had 60 minutes to complete this test, which was scored by optical readers. 
The reading section consisted of one beginning level, one intermediate level, and one advanced level passage about 400-500 words in 
length. Difficulty was rated impressionistically by teachers in terms of content, topic, and vocabulary level. Nakamura (1998, p. 260) proposed four points to consider in assessing reading ability: (1) the nature of reading, (2) the theoretical or linguistic 
underpinnings of reading, (3) the test format of reading, 4) classroom teachers' ideas based on their teaching experiences.  The construct of "reading 
ability" for this test was established mainly from these plus the specific aspects of the faculty of letters as follows:
Nakamura (1998, p. 260) proposed four points to consider in assessing reading ability: (1) the nature of reading, (2) the theoretical or linguistic 
underpinnings of reading, (3) the test format of reading, 4) classroom teachers' ideas based on their teaching experiences.  The construct of "reading 
ability" for this test was established mainly from these plus the specific aspects of the faculty of letters as follows:
 The grammar items were chosen by taking into consideration almost all of the grammar items that were supposed to have been mastered at the high 
school level. There are textbooks authorized by the Ministry of Education and are available at bookstores. Since we did not pretest items in 
order to determine their difficulty empirically, we relied on theory to create items and sections at different ability levels.  For example, the 
vocabulary items were based on word frequency counts using the benchmark of English Japanese dictionaries are available at bookstores, the grammar items 
were based on developmental sequences and on the written structures on textbook analysis. The textbooks authorized by the Ministry of Education available 
at bookstores.
The grammar items were chosen by taking into consideration almost all of the grammar items that were supposed to have been mastered at the high 
school level. There are textbooks authorized by the Ministry of Education and are available at bookstores. Since we did not pretest items in 
order to determine their difficulty empirically, we relied on theory to create items and sections at different ability levels.  For example, the 
vocabulary items were based on word frequency counts using the benchmark of English Japanese dictionaries are available at bookstores, the grammar items 
were based on developmental sequences and on the written structures on textbook analysis. The textbooks authorized by the Ministry of Education available 
at bookstores. 
 The reading passages were selected from the three disciplines (humanities, social sciences and natural sciences), and appropriate vocabulary levels 
were taken into consideration. The text passages were analyzed using L1 Flesch Reading Ease (Readability Formula) together with the judgments of experienced teachers.
The reading passages were selected from the three disciplines (humanities, social sciences and natural sciences), and appropriate vocabulary levels 
were taken into consideration. The text passages were analyzed using L1 Flesch Reading Ease (Readability Formula) together with the judgments of experienced teachers.
 The test data was analyzed using the RUMM statistical program 2020. The Chi-Square was investigated to determine if there was a huge gap among neighboring 
scores. The benchmark for the acceptable range for the FitResiduals scores was between -3 and +3.  The location order was examined to obtain the construct 
of the item difficulty order. The item characteristic curves were examined to check the discriminating power of each item. The benchmark for the person 
separation index of the test reliability was set at 0.7 or over.
The test data was analyzed using the RUMM statistical program 2020. The Chi-Square was investigated to determine if there was a huge gap among neighboring 
scores. The benchmark for the acceptable range for the FitResiduals scores was between -3 and +3.  The location order was examined to obtain the construct 
of the item difficulty order. The item characteristic curves were examined to check the discriminating power of each item. The benchmark for the person 
separation index of the test reliability was set at 0.7 or over.[ p. 100 ]
 In the explanation below four types of abbreviations will be used: G stands for grammar, V stands for vocabulary, R for teading and C for cloze.
In the explanation below four types of abbreviations will be used: G stands for grammar, V stands for vocabulary, R for teading and C for cloze. In order to check if the response fits the model , Chi-square is used.  The column Chi-square means the smaller the better, while the column probability 
(to show the magnitude) means the bigger the better.  Also, we examine if there is a big gap between the neighboring items in the Chi-square order.
In order to check if the response fits the model , Chi-square is used.  The column Chi-square means the smaller the better, while the column probability 
(to show the magnitude) means the bigger the better.  Also, we examine if there is a big gap between the neighboring items in the Chi-square order.

 This table shows that three items (C47, R29, and G11) need to be examined because there was a gap of 8 Chi-square points or scores from the neighboring 
items in the column of Chi-square.
This table shows that three items (C47, R29, and G11) need to be examined because there was a gap of 8 Chi-square points or scores from the neighboring 
items in the column of Chi-square.[ p. 101 ]
 FitResidual is used to check if the item has a discriminating power or not. The acceptable range is usually from -3 to +3.  The negative residual 
means overdiscriminating (overfitting), while the positive residual means underdiscriminating (underfitting).
FitResidual is used to check if the item has a discriminating power or not. The acceptable range is usually from -3 to +3.  The negative residual 
means overdiscriminating (overfitting), while the positive residual means underdiscriminating (underfitting).

 According to the benchmark of the acceptable range (-3 to 3), among the three items pointed out in the Chi-square investigation, R29 is regarded to 
be overfitting (overdiscriminating) and G11 is considered underfitting (underdiscriminating). Based on this Chi-square and FitResidual information, 
three items (R29 and G11 and C47) appear to be problematic. They need to be investigated further in terms of location order.
According to the benchmark of the acceptable range (-3 to 3), among the three items pointed out in the Chi-square investigation, R29 is regarded to 
be overfitting (overdiscriminating) and G11 is considered underfitting (underdiscriminating). Based on this Chi-square and FitResidual information, 
three items (R29 and G11 and C47) appear to be problematic. They need to be investigated further in terms of location order. 
[ p. 102 ]
----------------------------------------------------------------------------- Seq Item Type Location SE Residual DF ChiSq DF Prob ----------------------------------------------------------------------------- 28 R28 MC -3.234 0.253 -0.868 781.00 10.653 9 0.300268 10 G10 MC -2.098 0.153 -1.534 781.00 17.766 9 0.037983 14 G14 MC -1.953 0.145 -1.503 781.00 15.122 9 0.087630 29 R29 MC -1.702 0.131 -3.077 781.97 45.955 9 0.000001 27 R27 MC -1.606 0.127 -1.153 781.97 8.169 9 0.517196 13 G13 MC -1.311 0.114 -0.772 780.02 37.366 9 0.000023 34 R34 MC -1.306 0.114 -1.958 781.00 19.300 9 0.022759 39 R39 MC -1.207 0.110 -1.976 778.06 23.730 9 0.004751 2 G02 MC -1.191 0.110 -0.572 781.97 4.759 9 0.854781 26 R26 MC -1.169 0.109 -0.631 781.97 13.323 9 0.148513 46 C46 MC -1.016 0.104 -2.359 775.12 23.194 9 0.005777 7 G07 MC -0.997 0.103 -2.294 781.00 27.909 9 0.000988 40 R40 MC -0.820 0.098 -0.874 778.06 8.037 9 0.530397 1 G01 MC -0.680 0.094 -1.467 781.97 15.424 9 0.079919 45 C45 MC -0.526 0.091 -0.768 777.08 14.598 9 0.102588 8 G08 MC -0.448 0.089 0.661 781.00 6.283 9 0.711335 24 V24 MC -0.393 0.088 2.409 781.97 35.324 9 0.000052 3 G03 MC -0.227 0.085 -3.647 780.02 32.560 9 0.000159 37 R37 MC -0.103 0.083 -1.914 779.04 18.754 9 0.027367 6 G06 MC -0.064 0.082 -2.258 781.00 18.563 9 0.029180 36 R36 MC -0.034 0.082 0.430 774.14 12.346 9 0.194485 15 G15 MC -0.024 0.081 0.012 780.02 14.703 9 0.099426 12 G12 MC 0.057 0.080 1.072 780.02 13.304 9 0.149347 50 C50 MC 0.103 0.081 1.497 764.36 29.045 9 0.000637 48 C48 MC 0.172 0.080 0.518 771.21 9.040 9 0.433593 38 R38 MC 0.209 0.079 0.215 779.04 7.787 9 0.555765 31 R31 MC 0.218 0.079 1.230 780.02 8.772 9 0.458594 23 V23 MC 0.284 0.078 0.658 781.00 6.175 9 0.722263 5 G05 MC 0.341 0.077 0.211 779.04 11.815 9 0.223935 33 R33 MC 0.347 0.077 -0.186 781.97 9.051 9 0.432568 18 V18 MC 0.405 0.077 1.989 778.06 9.579 9 0.385588 41 C41 MC 0.484 0.076 0.962 780.02 13.116 9 0.157419 21 V21 MC 0.527 0.076 -1.753 781.00 17.097 9 0.047220 32 R32 MC 0.533 0.076 -0.613 781.97 10.796 9 0.289949 9 G09 MC 0.652 0.075 -0.882 780.02 15.579 9 0.076204 4 G04 MC 0.672 0.075 3.148 781.00 18.594 9 0.028875 19 V19 MC 0.723 0.075 4.998 780.02 14.843 9 0.095339 25 V25 MC 0.728 0.075 1.496 781.97 4.276 9 0.892325 22 V22 MC 0.748 0.075 0.978 776.10 12.365 9 0.193523 43 C43 MC 0.802 0.075 0.976 780.02 7.183 9 0.618101 42 C42 MC 0.902 0.075 1.848 778.06 13.526 9 0.140218 16 V16 MC 0.975 0.075 4.733 780.02 21.973 9 0.008967 17 V17 MC 0.978 0.075 2.545 780.02 15.440 9 0.079538 44 C44 MC 1.116 0.076 3.518 771.21 19.936 9 0.018314 11 G11 MC 1.255 0.076 5.942 780.02 61.082 9 0.000000 30 R30 MC 1.353 0.076 0.042 781.97 6.548 9 0.684044 20 V20 MC 1.483 0.077 -0.294 781.00 8.577 9 0.477171 35 R35 MC 1.499 0.077 -0.482 780.02 14.418 9 0.108223 49 C49 MC 1.975 0.084 2.342 764.36 30.129 9 0.000417 47 C47 MC 2.570 0.095 2.698 774.14 45.466 9 0.000001 ------------------------------------------------------------------------------------------------------
 R 29 is the closest to the easiest item (the fourth easiest) in the order, while G11 is the sixth most difficult one.  
C47, which was pointed out as problematic in terms of its Chi-Square order, is the most difficult one.  Also, the location order 
shows that Reading items and Grammar items tend to be placed on the easier side of the continuum while Vocabulary items and Cloze 
items are relatively difficult. In other words, the location order shows the construct of item difficulty order.
R 29 is the closest to the easiest item (the fourth easiest) in the order, while G11 is the sixth most difficult one.  
C47, which was pointed out as problematic in terms of its Chi-Square order, is the most difficult one.  Also, the location order 
shows that Reading items and Grammar items tend to be placed on the easier side of the continuum while Vocabulary items and Cloze 
items are relatively difficult. In other words, the location order shows the construct of item difficulty order.[ p. 103 ]
Item characteristic curve (ICC) ICC is used to show in detail the degree of agreement between the observed proportions and the theoretical curve.
ICC is used to show in detail the degree of agreement between the observed proportions and the theoretical curve.

 This ICC of G11 shows us that the less able students performed better than anticipated. It also indicates that the more able students performed 
 more poorly than anticipated. This further shows that this item did not discriminate well between  lower level students and intermediate level 
 students. A likely reason is that   this item was a little too difficult (1.255 logits above the mean). This item probably would  function better 
 to differentiate the more able students at the top end. From the lower end to the mid group, there is no discrimination, even negative discrimination.  
 From the mid to the top it has some discriminating power, but still not as much as anticipated. In short this test did not yield three separate groups 
 as clearly as wished.
 This ICC of G11 shows us that the less able students performed better than anticipated. It also indicates that the more able students performed 
 more poorly than anticipated. This further shows that this item did not discriminate well between  lower level students and intermediate level 
 students. A likely reason is that   this item was a little too difficult (1.255 logits above the mean). This item probably would  function better 
 to differentiate the more able students at the top end. From the lower end to the mid group, there is no discrimination, even negative discrimination.  
 From the mid to the top it has some discriminating power, but still not as much as anticipated. In short this test did not yield three separate groups 
 as clearly as wished.
 This ICC of R29 shows that the item is problematic because it is overdiscriminating. We can tell the difference between the lower and the intermediate 
level students. However, it does not discriminate among the top level students. The top level students seem to have some advantage or bias about the 
topic.  The less able students do not fit the model. The lower end is over discriminating, and the lower group is performing more poorly than anticipated.
This ICC of R29 shows that the item is problematic because it is overdiscriminating. We can tell the difference between the lower and the intermediate 
level students. However, it does not discriminate among the top level students. The top level students seem to have some advantage or bias about the 
topic.  The less able students do not fit the model. The lower end is over discriminating, and the lower group is performing more poorly than anticipated. 
  [ p. 104 ]

 This ICC of C47 indicates that the item has no discriminating power.  All the groups get the item correct under the guessing level. This is probably 
the reason it was pointed out as problematic in terms of its Chi-Square order.
This ICC of C47 indicates that the item has no discriminating power.  All the groups get the item correct under the guessing level. This is probably 
the reason it was pointed out as problematic in terms of its Chi-Square order. 
 So far, only three items have been pointed out as problematic. And when we think about the percentage of these three problematic items, they are just 
three out of 50, or 6% of the whole. However, it may seem too quick to conclude that this figure is relatively slight.
So far, only three items have been pointed out as problematic. And when we think about the percentage of these three problematic items, they are just 
three out of 50, or 6% of the whole. However, it may seem too quick to conclude that this figure is relatively slight. Now let's examine Figure 4, which demonstrates the distracter information curve. The distracter information curve measures indicate how the distracters 
(option answers) are functioning to be attractive to the test takers.
Now let's examine Figure 4, which demonstrates the distracter information curve. The distracter information curve measures indicate how the distracters 
(option answers) are functioning to be attractive to the test takers. 

 Figure 4 looks strange because up to 1.2, the key and the distracters  functioned in a confusing way. After 1.2 ability level, the key answer  
functioned properly.  Between 0.7 and 1.0, the students preferred Option 2 to the key answer.  The lower and upper ability students in this case, 
got that item correct.
Figure 4 looks strange because up to 1.2, the key and the distracters  functioned in a confusing way. After 1.2 ability level, the key answer  
functioned properly.  Between 0.7 and 1.0, the students preferred Option 2 to the key answer.  The lower and upper ability students in this case, 
got that item correct.[ p. 105 ]
 Let us take a look at how the distracters are behaving in the following item.
Let us take a look at how the distracters are behaving in the following item.

 For R 29: the key answer functioned well - other distracters were less common than the key answer. This item appears to be a reasonably good distracter.
For R 29: the key answer functioned well - other distracters were less common than the key answer. This item appears to be a reasonably good distracter.  Let us look at how the distracters are functioning in the following item.
Let us look at how the distracters are functioning in the following item.

 Item C47 is strange and difficult. The three distracters and the key answer do not function at any level.  Even the key is chosen under the guessing level. 
It seems that there are two correct answers with Option 3 as the most popular.  All the students misunderstand the concept. The key answer is not discriminating.
Item C47 is strange and difficult. The three distracters and the key answer do not function at any level.  Even the key is chosen under the guessing level. 
It seems that there are two correct answers with Option 3 as the most popular.  All the students misunderstand the concept. The key answer is not discriminating.[ p. 106 ]
 Information targeting shows the relative position between persons (person ability) and items (item difficulty).  The following graph shows us that we need 
some more difficult items to match the more able students in the future version.
Information targeting shows the relative position between persons (person ability) and items (item difficulty).  The following graph shows us that we need 
some more difficult items to match the more able students in the future version.

 This figure suggests that as a whole the test was very good at measuring students' English proficiency. For future improvement, more difficult items 
are needed to match the more able students at the top of this continuum.
This figure suggests that as a whole the test was very good at measuring students' English proficiency. For future improvement, more difficult items 
are needed to match the more able students at the top of this continuum. The reliability was investigated by the person separation index, which is akin to the Cronbach Alpha. The benchmark for the 
acceptable boundary is over 0.7. The reliability of this placement test had a score of 0.78 in terms of the person separation 
index. This suggests the items in this test were internally consistent.
The reliability was investigated by the person separation index, which is akin to the Cronbach Alpha. The benchmark for the 
acceptable boundary is over 0.7. The reliability of this placement test had a score of 0.78 in terms of the person separation 
index. This suggests the items in this test were internally consistent.[ p. 107 ]
 This study explored whether the pilot version of this placement test had enough validity, reliability to proceed to the real 
test. Three null hypotheses pertaining to the primary research question were rejected.
This study explored whether the pilot version of this placement test had enough validity, reliability to proceed to the real 
test. Three null hypotheses pertaining to the primary research question were rejected.
 The ability level of the top group was higher than the difficulty level of items in most of the cases.  On the other hand, the 
ability of the bottom group who need remedial instruction was below the difficulty level of reading section and the grammar 
section in most of the cases.
The ability level of the top group was higher than the difficulty level of items in most of the cases.  On the other hand, the 
ability of the bottom group who need remedial instruction was below the difficulty level of reading section and the grammar 
section in most of the cases.
| ". . . validity examinations should be conducted in detailed ways which include concurrent validity and/or factor analysis methods." | 
 Hypothesis 1, "The test does not have enough validity," was not verified. The discovery of three problematic items was a minor defect from the 
viewpoint of the whole test. In other words, 94% the test items fit the model, which technically verifies the construct validity of the test. 
However, validity examinations should be conducted in detailed ways which include concurrent validity and/or factor 
analysis methods.
Hypothesis 1, "The test does not have enough validity," was not verified. The discovery of three problematic items was a minor defect from the 
viewpoint of the whole test. In other words, 94% the test items fit the model, which technically verifies the construct validity of the test. 
However, validity examinations should be conducted in detailed ways which include concurrent validity and/or factor 
analysis methods.
 Hypothesis 2, "The test does not have acceptable reliability," was not verified. The reliability was investigated by the person separation index, 
and had an acceptable boundary of over 0.7. Accordingly, the alternate hypothesis "The test is reliable" was accepted.
Hypothesis 2, "The test does not have acceptable reliability," was not verified. The reliability was investigated by the person separation index, 
and had an acceptable boundary of over 0.7. Accordingly, the alternate hypothesis "The test is reliable" was accepted.
 The 2007 version of the test should explore the issue of face validity and practicality more systematically. Since the 2006 
test was investigated mainly in terms of reliability and construct validity, further research is needed to fully corroborate 
this test.
The 2007 version of the test should explore the issue of face validity and practicality more systematically. Since the 2006 
test was investigated mainly in terms of reliability and construct validity, further research is needed to fully corroborate 
this test. Considering McNamara's (2000, p. 83) statement "The right balance of three basic critical dimensions of tests – validity, 
reliability and practicality – will depend on the test context and test purpose," the present placement test should be 
regarded as acceptable judging from the statistical analyses and the test context as well as the test purpose. For  
future improvement, the predictive validity should be investigated as well as the concurrent validity, the factor analysis 
and the multi-trait multi-method (MTMM) analysis for the test validation. Future studies should also explore the issue of 
face validity and practicality more systematically.
Considering McNamara's (2000, p. 83) statement "The right balance of three basic critical dimensions of tests – validity, 
reliability and practicality – will depend on the test context and test purpose," the present placement test should be 
regarded as acceptable judging from the statistical analyses and the test context as well as the test purpose. For  
future improvement, the predictive validity should be investigated as well as the concurrent validity, the factor analysis 
and the multi-trait multi-method (MTMM) analysis for the test validation. Future studies should also explore the issue of 
face validity and practicality more systematically.[ p. 108 ]
| Acknowledgement The present author is grateful to Dr. David Andrich and Dr. Irene Styles for their invaluable comments and professional help. | 
 Topic Index
	Topic Index Author Index
	Author Index Page Index
	Page Index Title Index
	Title Index Main Index
	Main Index
 Topic Index
	Topic Index Author Index
	Author Index Page Index
	Page Index Title Index
	Title Index Main Index
	Main Index
[ p. 109 ]