So what are we listening for?
by Kristie Sage (Komazawa University & Gakushuin Universities) |
Through comparing and contrasting the English listening sections of the Test of English as a Foreign Language Internet Based Test (TOEFL® iBT) and the 2006 Japanese National Centre Test (J-NCT), the development of a test higher in construct validity for the later is proposed by this research. The J-NCT has a significant gate-keeping function and this is the first study to systematically examine the test items used in the J-NCT English listening section utilizing both empirical and judgmental analyses. This study examines issues such as construct breadth and response formats. It concludes that the listening score of the TOEFL® iBT is a better indication of students' English listening ability than the J-NCT as it is more integrative and representative of both academic and conversational discourse domains. Keywords: Japanese National Centre Test (J-NCT), construct validity, item analysis, empirical analysis, judgemental analysis, constructs, test items, integrative testing, TOEFL® iBT (Test of English as a Foreign Language Internet Based Test)
|
[ p. 74 ]
"In objective-format tests which are favoured for norm-referenced tests, reliability is often upheld – but not necessarily validity. . ." |
[ p. 75 ]
Next, the item discriminatrion for each item was calculated, as in the bottom table in Appendix B. This data was totalled and the descriptive statistics for this test were interpreted and summarized in Table 1. Microsoft Excel was used to calculate these and to graph the scores for the bell curve in Figure 2.Number of items: | 50* | Mode: | 38 | Standard Deviation: | 7.45 |
High score: | 38 | Median: | 37 | Skewness: | -0.02 |
Low score: | 22 | Mean: | 38 | Kurtosis: | -0.95 |
* 25 items, weighted 2 points per item | Range: | 22-50 |
[ p. 76 ]
The colour coding represents five combined IF and ID levels. The percentages grouped by the bracket in the far left column labelled "REJECT" show the test questions by section that had both unacceptable ID and IF levels (64%). The red coloured row indicates that both the IF and ID were significantly low, that is, the closest to the statistical cut-off points. As red merges towards light orange, the IF and ID become further away from the statistical cut-offs, yet are still within the advised range for rejection. At the colour beige, they can be kept as the test tasks show the LG got them wrong while the HG got them right, therefore differentiating students' proficiency well. It is a concern that this represents only 36% of the question items. The star, circle, and rectangle markings will be discussed in detail in the following section.Section 1 sample question: |
Question: What does the man order? Recorded Dialogue:
|
[ p. 77 ]
• Section 2Section 2 sample question: |
Recorded Dialogue:
1) No, bowling doesn't seem interesting. 2) No, I didn't know about it. 3) Yeah, I have to study on Friday night. 4) Yeah, I'm free then. |
[ p. 78 ]
• Section 3Section 3 sample question: |
Written Schedule:
1) Cleanup 2) Cooking lessons 3) Cultural presentations 4) International folk dancing 5) Mr. Cranston's opening speech 6) Music demonstrations |
[ p. 79 ]
• Section 4Section 4 sample question: |
Spoken Monolog: "... Suddenly, the weather forecasters were shocked to find out that there was not one, but two powerful storms approaching the island. In fact, the first one was being followed by an even more powerful one." Spoken Question:
Why were the weather forecasters shocked? 1) A second hurricane was approaching the island. 2) The destruction was expanding rapidly. 3) The hurricane lasted much longer than usual. 4) They had glorious weather in spite of the hurricane. |
[ p. 80 ]
The TOEFL® iBT in Japan". . . the J-NCT test scores do not clearly reflect how well examinees' can perform in academic settings." |
[ p. 81 ]
[ p. 82 ]
Further, our empirical analysis showed that the J-NCT test items assess only a limited construct of conversational English listening ability. This paper proposes that this failing is attributed to the exclusive use of a multiple-choice response format throughout this exam. Supported by the fact for the survey group in this study, 64% of the test items in the J-NCT produced both a low IF and ID. Future research studies should see how different survey samples respond to this test. It has been argued that if an integrated test construct such as in the TOEFL® iBT was employed, English academic listening skills, such as those required in the university environment would be addressed. Thus, a larger survey sample of the J-NCT and TOEFL® iBT listening tests would be beneficial in terms of seeing how their scores correlate. Can the TOEFL® iBT discriminate adequately among students within the same university? Would the score band width be sufficiently broad to make either of these tests a feasible placement tool? Those are some of the questions to be answered in future studies."A procedure that incorporates consistent piloting, analysing and revising of items can only enhance a test's validity." |
[ p. 83 ]
• Proposal 3: Pilot the J-NCT and delete items which perform poorly[ p. 84 ]
[ p. 85 ]
Main Article | Appendix A | Appendix B | Appendix C | Appendix D | Appendix E |
Appendix F | Appendix G | Appendix H | Appendix I | Appendix J | Appendix K |