So what are we listening for? A comparison of the English listening constructs in the Japanese National Centre Test and TOEFL iBT

Authentic Communication: Proceedings of the 5th Annual JALT Pan-SIG Conference.
May 13-14, 2006. Shizuoka, Japan: Tokai University College of Marine Science. (p. 74 - 98).

So what are we listening for?
A comparison of the English listening constructs in
the Japanese National Centre Test and TOEFL^® iBT

by Kristie Sage (Komazawa University & Gakushuin Universities)
Nozomi Tanaka (Ochanomizu University)

Abstract

Through comparing and contrasting the English listening sections of the Test of English as a Foreign Language Internet Based Test (TOEFL^® iBT) and the 2006 Japanese National Centre Test (J-NCT), the development of a test higher in construct validity for the later is proposed by this research. The J-NCT has a significant gate-keeping function and this is the first study to systematically examine the test items used in the J-NCT English listening section utilizing both empirical and judgmental analyses. This study examines issues such as construct breadth and response formats. It concludes that the listening score of the TOEFL^® iBT is a better indication of students' English listening ability than the J-NCT as it is more integrative and representative of both academic and conversational discourse domains.

Keywords: Japanese National Centre Test (J-NCT), construct validity, item analysis, empirical analysis, judgemental analysis, constructs, test items, integrative testing, TOEFL^® iBT (Test of English as a Foreign Language Internet Based Test)

Japanese Abstract/Keywords

What is the most important quality of a test? Bachman and Palmer (1996, pp. 17-18) state its intended use is, and suggest a test usefulness framework for quality control throughout the test development cycle. High-stakes tests which are administered to large numbers, needless to say, necessitate high-quality results (Chalhoub-Deville & Turner, 2000, p. 526, 537). To do so, the highest possible levels of construct validity and reliability should be aimed for (Bachman & Palmer, 1996, p. 23, 135-136). These critical measurement are used to justify the inferences and decisions made based on test scores (McNamara, 2000, p. 7-11; Chalhoub-Deville and Turner, 2000, p. 524-6). Six factors thought to influence test usefulness are depicted in Figure 1.

[ p. 74 ]

Figure 1. A model of test usefulness from Bachman & Palmer (1996, p.18)

In objective-format tests which are favoured for norm-referenced tests, reliability is often upheld – but not necessarily validity (McNamara, 2000). Construct validity relates to domain generalisation, that is, the degree to which the scores produced from a test method are reflective of the ability we wish to measure. Construct validity is not just test specific, but pertains to Target Language Use (TLU) situations (Bachman & Palmer, 1996, p. 21; McNamara, 2000, p. 52). At times TLU can be undermined by both construct under-representation and to a lesser extent, construct irrelevant variance (Fulcher, 2000; McNamara, 2000, p. 53). The former occurs when too little is demanded from a cohort being examined; the later, when tests introduce aspects irrelevant to the ability being measured (McNamara, 2000, p. 53).

"In objective-format tests which are favoured for norm-referenced tests, reliability is often upheld – but not necessarily validity. . ."

Judgemental and empirical approaches to data analysis

Validity is a fundamental notion to consider throughout the test development and evaluation process (Chalhoub-Deville & Turner, 2000, p. 525). Two ways to improve it are via empirical and judgemental analyses. The former ascertains whether or not test scores quantitatively differentiate between examinees' ability (Popham, 1990, pp. 268 & 226; McNamara, 1996). Judgemental analysis, by contrast, relies on qualitative assessments about the merit of individual test items (Popham, 1990, p. 266). It also relies on non-partisan, external reviewers or actual test-takers to provide valuable insights about test items (Popham, 1990, p. 269). McNamara (2000, p. 32) supports gathering test-taker feedback as it is a good way to spot weaknesses that may not be identified by test developers, contending that examinees do have the capacity to comment on difficulty level, identify unclear rubric and so on. For example, 672 students from a prominent exam preparation school survey, Sundai Cram School, were given a questionnaire pertaining to the English listening section of the 2006 J-NCT. One question was, "Did you find any questions difficult?" 78.2% of the respondents answered "No" (Yomiuri Shinbun, 2006).

Descriptive statistics

In this study, the raw responses from a sample of 51 female students at a private senior girls' high school in the Nagoya area who took the 2006 J-NCT English listening test were collated, analysed, and evaluated. This primary data was recorded in an Excel^® spreadsheet matrix. Using a standard item analysis procedure, it was converted into binomial data (McNamara, 1996, pp. 261-268; Brown, 1995b, pp .43-45). That is, the examinees' correct answers were assigned a 1 and incorrect answers, 0. Assigning random ID numbers to respondents, the individual scores were ranked, as in the top table of Appendix B).

[ p. 75 ]

Next, the item discriminatrion for each item was calculated, as in the bottom table in Appendix B. This data was totalled and the descriptive statistics for this test were interpreted and summarized in Table 1. Microsoft Excel was used to calculate these and to graph the scores for the bell curve in Figure 2.

Table 1. Descriptive statistics of the sample for the 2006 J-NCT English listening test

Number of items:	50*	Mode:	38	Standard Deviation:	7.45
High score:	38	Median:	37	Skewness:	-0.02
Low score:	22	Mean:	38	Kurtosis:	-0.95
* 25 items, weighted 2 points per item				Range:	22-50

Figure 2. 2006 J-NCT English listening section score distribution

Item discrimination

Sequentially, item discrimination and difficulty indices were employed. Item discrimination (ID) ascertains whether a test takers' performance shows uniformity across the examined items, and item difficulty or facility (IF) investigates the properties of individual test items' appropriateness for the target group's level (McNamara, 2000, p. 60). From the four sections in the J-NCT English listening test, the test tasks which showed both low IF and low ID were further investigated. Items should be rejected if the IF is p<.33 or p>.67 (Henning, 1987, p. 49). To calculate ID, first a High Group (HG) and Low Group (LG) must be established. As suggested by Brown (1995b, pp. 43-44) it should be between 25% - 35% of the total group. If the group is not large, Henning (1987, pp. 51-52) 20% - 25%. For this study, 27% (n=14) was used. If the ID of an item was >.67, it was rejected as shown in Appendix A, as this is the lowest acceptable cut-off point (Henning, 1987, p. 52). All calculations were performed on Microsoft Excel and are summarized in Figure 3.

Figure 3. ID and IF levels for the 2006 J-NCT English listening test questions grouped by sample section

[ p. 76 ]

The colour coding represents five combined IF and ID levels. The percentages grouped by the bracket in the far left column labelled "REJECT" show the test questions by section that had both unacceptable ID and IF levels (64%). The red coloured row indicates that both the IF and ID were significantly low, that is, the closest to the statistical cut-off points. As red merges towards light orange, the IF and ID become further away from the statistical cut-offs, yet are still within the advised range for rejection. At the colour beige, they can be kept as the test tasks show the LG got them wrong while the HG got them right, therefore differentiating students' proficiency well. It is a concern that this represents only 36% of the question items. The star, circle, and rectangle markings will be discussed in detail in the following section.

Examining the J-NCT section-by-section

• Section 1

This section consists of three (out of five) picture tasks in which respondents are asked to select the picture that most closely matches a conversation. The example below has been categorised as a picture task according to Valette (1977) and Heaton (1990) as cited in Buck (2001, p. 135). According to the TOEFL^® iBT's question taxonomy, it would not even fall in any of the Basic Comprehension Questions categories, which are: Gist-content; Gist-purpose; and Detail (see Appendix G).

Section 1
sample
question:

Question: What does the man order?

Recorded Dialogue:

Woman:	What would you like to order?
Man:	I'd like eggs with toast and, uh, sausage . . . no sorry, I'd rather have bacon.
Woman:	OK. And coffee?
Man:	Sure.

The exercise from Section 1, Question 2 of the 2006 J-NCT English listening test (NCEUU, 2006d)

100% of the respondents answered the question in Figure 4 correctly. Therefore it had no discriminating value (IF=1; ID=.5) since the item failed to indicate a HG nor a LG. Furthermore, three other questions from this section did not discriminate well between the proficient and less proficient examinees. It seems significant that three of the problematic questions had illustrations as the medium of response (Henning, 1987, p. 48).

[ p. 77 ]

• Section 2

In this section, each dialogue consists of two or three one-sentence conversation turns between a man (M) and a woman (W). Test takers are required to select the next line in the dialogue by choosing one of four multiple-choice options, as in the example below.

Section 2
sample
question:

Recorded Dialogue:

Man:	Kathy, do you want to go bowling with us?
Woman:	Sure, but it depends. When are you going?
Man:	This Friday night. Can you come?
Woman:

Written Response:
1) No, bowling doesn't seem interesting.
2) No, I didn't know about it.
3) Yeah, I have to study on Friday night.
4) Yeah, I'm free then.

The exercise from Section 2, Question 12 of the 2006 J-NCT English listening test (NCEUU, 2006d)

This type of activity is referred to as a conversation task (Valette, 1977 & Heaton, 1990 as cited in Buck, 2001, p. 135).
Of the seven questions in this section, while some have either low IF or low ID, Q10 and Q12 rate poorly in both IF and ID, and hence made no distinction between the HG and LG cohorts as depicted in the circled items in Figure 3. For Q12, 49 out of 51 examinees answered correctly (IF=.96; ID=.51 as stated in Appendix A). This should prompt closer examination of the or alternate response cues (distracters) which may not have been made attractive enough (Henning, 1987, p. 44). Three judgemental assessments are worth noting. First, the intonation of the female speaker in the audio recording is enthusiastic, and does not fit with the responses of 1 or 2. Secondly, option 3 may be easily eliminated as studying is generally not a Friday night activity, and hence could be a common knowledge response (Henning, 1987, p. p.47). Thirdly, responses 1 or 2 are ruled out since the existing dialogue already shows some keenness for the bowling activity and could hence these represent nonsense distracters (Henning, 1983, p. 45).

[ p. 78 ]

• Section 3

Test takers were given an event schedule in a table format. Three activities were missing and examinees were required to choose the correct three from six options in order to fill in all of the missing time spots. Here are three sample questions from this section.

Section 3
sample
question:

Written Schedule:

Time Slot	Activity
9:00 - 9:30	Opening Ceremony
9:30 - 10:00	[17]
10:00 � 12:00	[18]
12:00 � 14:00	Lunch
14:00 � 16:00	[19]
16:00 � 16:30	Closing Ceremony

Written Response Options:
1) Cleanup
2) Cooking lessons
3) Cultural presentations
4) International folk dancing
5) Mr. Cranston's opening speech
6) Music demonstrations

The exercise from Section 3, Question 17-19 of the 2006 J-NCT English listening test (NCEUU, 2006d)

Out of the four sections, Section 3 was answered the most poorly. From the discrimination indices used, five out of its six (83%) questions should be rejected. In particular, Part B demonstrated that all three questions Q17-Q19 (100%) (Q17 - IF=.92; ID=.54; Q18 - IF=.88; ID=.56; Q19 - IF=.78; ID=.61), displayed results signalling the need for review (See Appendix A). It is noteworthy that for Qs 17-19 of Part B the multiple-choice options were changed from three to six alternatives. Henning (1987, p.45) identifies that number of options can prose problems. It is therefore possible that too many have been employed. This may cause some test takers to quickly forget the aural stimulus and reduce the validity of this as a listening comprehension question (Henning, 1987, p. 45). Additionally, Henning (1987, p. 45) states when options are irregular in number a test question can become weaker. This makes it difficult to distinguish whether the problem lies in the content or the options.

[ p. 79 ]

• Section 4

This section consists of extended monologues in which examinees choose a response that best answers the questions that follow. The following excerpt is representative:

Section 4
sample
question:

Spoken Monolog:

"... Suddenly, the weather forecasters were shocked to find out that there was not one, but two powerful storms approaching the island. In fact, the first one was being followed by an even more powerful one."

Spoken Question:

Why were the weather forecasters shocked?

Written Choices:
1) A second hurricane was approaching the island.
2) The destruction was expanding rapidly.
3) The hurricane lasted much longer than usual.
4) They had glorious weather in spite of the hurricane.

Part of the exercise from Question 24 of Section 4, Part B of the 2006 J-NCT English listening test with italics added by the author to the matching material. (NCEUU, 2006d)

In general, Q23 and Q25 bore good results (Q23 � IF=.49; ID=.78; 25 � IF=.41; ID=1). That is, overall the HG answered correctly while the LG did not. However, Q24 showed a low IF of .71 and a low ID between the HG and LG (ID=.6). In other words, 36 test takers out of 51 got it right (See Appendix B). Distracter options 2, 3, and 4 were too improbable. Additionally, the correct response for the passage above has too many words which appeared directly the text. Termed matching material by Henning (1987, p.47) this raises the question whether actual comprehension of the passage has taken place or if the test taker has simply matched words.
Another problem with this section is that Questions 23 to 25 pertained to an extended monologue which was approximately three times the length of the other three monologues.

[ p. 80 ]

The TOEFL^® iBT in Japan

The TOEFL^® is used worldwide as an indicator of students' English language ability. Currently over 5,000 institutions adopt it and its number of test takers exceed 720,000 annually. In Japan an estimated 90,000 students each year take this test (CIEE Nihon Daihyobu, 2005). This international recognition is good reason for Japan's Ministry of Education, Culture, Sports, Science and Technology (MEXT) to encourage and support the use of this proficiency test in the Japanese education system. It is part of its Action Plan which aims to cultivate "Japanese with English abilities" (MEXT, 2003). In addition, a great number of universities in Japan utilize the TOEFL^® Institutional Testing Program (ITP) with 135,000 students per year participating in it (CIEE Nihon Daihyoubu, 2004).
The TOEFL's degree of influence can be still further demonstrated by its popular adoption as a placement test, an achievement test, and even as a progress test (Brown, 1995a, pp. 14-15; 1995b, p. 40). The CIEE Nihon Daihyobu (2004) estimates that 312 universities currently use TOEFL^® for a variety of purposes. These include using it as a substitute for the English section of the entrance examination, adopting it as a method for obtaining credit; using it to screen study abroad applicants; and so on. In Japan the newest version of TOEFL^® is an Internet based TOEFL^® iBT. Instigated in 2005, this emphasises more interactive, authentic, and communicative testing approaches and is anticipated to affect English education in Japan (ETS, 2005, p.2).

Construct differences between the TOEFL^® iBT and Japanese NCT English listening sections

In light of the TOEFL's^® ubiquity in Japan, this study attempts to draw attention to two ways that its English listening section differs from the J-NCT in terms of construct validity, from the J-NCT (For more details see the Appendices). They have been summarised below. Comments are first made about the J-NCT and then the TOEFL^® iBT's respective English listening sections.

• Difference #1: Breadth of coverage

The monologues and dialogues in the J-NCT English listening section predominantly cover everyday life contexts such as ordering food at restaurants or interacting out-of-class with peers. As Coombe et al. (1998, p. 27) point out, the listening skills used for general purpose conversations differ from academic listening. Appendices D - H illustrate some of the ways these discourse domains differ.

". . . the J-NCT test scores do not clearly reflect how well examinees' can perform in academic settings."

The J-NCT only focuses on English listening skills pertaining to general purpose conversational discourses. For this reason test takers' are merely required to comprehend lower level and literal utterances (Buck, 2001, p.204). Thus, a narrower view of the listening construct than what is demanded from a Japanese university is tested. It can be said that the J-NCT test scores do not clearly reflect how well examinees' can perform in academic settings (Bachman & Palmer, 1996, p. 150). By contrast, the TOEFL^® iBT English listening section utilises independent (receptive) and integrated (combined) test item tasks, as indicated in Appendix K. This allows for coverage of both listening and academic skills and targets, on the whole, a broader listening construct. Hence scores are provided that are better measures of the listening ability of aspiring students' for the academic arena.

[ p. 81 ]

• Difference #2: Response format

While the J-NCT only uses an multiple-choice format, the TOEFL^® iBT adopts multiple formats. According to Bachman and Palmer, dichotomous scores are not effective indicators of proficiency levels (1996, p. 150). Further, this discrete response format promotes testing of the formal linguistic system (McNamara, 2000, p. 14), where "localized grammatical characteristics" rather than broader, global discourse skills are the focus (Buck, 2001, p. 123). That is, multiple-choice is considered to hamper the score generalisability for the domain of generalisation, the university environment (Mercier, 1933 as cited in Fulcher, 2000; Bachman & Palmer, 1996). Conversely, the TOEFL^® iBT English listening section is more synonymous with the trend towards pragmatic and integrative testing; which is being driven by today's communicative and subsequently productive language use focus (McNamara, 2000, p. 14). In other words, it uses a variety of test tasks which: a) engage the students in different areas of listening language ability; and b) correspond better to the university environment than any one response format. Furthermore, if the test method consists of a mix of short answer, open ended, inference items, and so on, lower and higher order listening skills are addressed (Buck, 2001).

Conclusion and recommendations

This paper has compared and contrasted the English listening sections of the J-NCT and the TOEFL^® iBT by focussing predominantly on construct validity. It has been argued that since the TOEFL^® iBT English listening section uses a variety of test tasks it is capable of addressing both conversational and academic listening skills. Thus, the construct validity of test scores obtained from the TOEFL^® iBT English listening section is considered higher than the J-NCT's. Hence the listening construct of the TOEFL^® iBT is recommended to be used as a model for future J-NCT English listening section development. Specifically, the breadth of coverage in terms of its combination of academic listening genres and academically hued conversations and the response format encompassing integrated tasks which are not limited solely to multiple-choice.
That is not to say the TOEFL^® iBT is a perfect model. Some shortcomings for directly replicating its English listening section could for example be, the time it takes to complete, 60 - 90 minutes, as opposed to 30 minutes for the J-NCT (See Appendix K). After all, just how many response formats can the J-NCT cover in the limited time available? Also, regarding the integrated scoring method; it would certainly be more difficult, time consuming and costly to operationalise than the current multiple-choice. Furthermore, there are practical limitations involved. Especially in the short-term, the J-NCT could probably not become a computer-adaptive examination as financial and logistical considerations involved in purchasing and providing access to computers are significant.
Nonetheless and despite the aforementioned, this paper has provided, by analysing both tests judgementally and empirically, some solid backing for the test method of the TOEFL^® iBT English listening section. In short, the judgemental analysis conducted in this paper introduced a questionnaire results which revealed that many test-takers' considered the J-NCT English listening section to be "easy". Since the TOEFL^® iBT was first administered in Japan on 15 July 2006, comparable data of examinees' perceptions of this test is not yet readily available. However, it is likely that since it is a computer-adaptive listening exam, and the integrated section requires examinees to listen to a university lecture, read a passage pertaining to this lecture and then type a response into the computer, the level will be just beyond the respondent's ability.

[ p. 82 ]

Further, our empirical analysis showed that the J-NCT test items assess only a limited construct of conversational English listening ability. This paper proposes that this failing is attributed to the exclusive use of a multiple-choice response format throughout this exam. Supported by the fact for the survey group in this study, 64% of the test items in the J-NCT produced both a low IF and ID. Future research studies should see how different survey samples respond to this test. It has been argued that if an integrated test construct such as in the TOEFL^® iBT was employed, English academic listening skills, such as those required in the university environment would be addressed. Thus, a larger survey sample of the J-NCT and TOEFL^® iBT listening tests would be beneficial in terms of seeing how their scores correlate. Can the TOEFL^® iBT discriminate adequately among students within the same university? Would the score band width be sufficiently broad to make either of these tests a feasible placement tool? Those are some of the questions to be answered in future studies.

"A procedure that incorporates consistent piloting, analysing and revising of items can only enhance a test's validity."

Concrete proposals

Test development is cyclical, not linear (McNamara, 2000, p. 23). That is, once a test is designed, constructed, trialled and operationalised its actual use generates evidence about its qualities (McNamara, 2000, p. 32). Since the J-NCT listening section was administered for the first time in 2006, to improve it for future years an ongoing test validation process is recommended. After all, from the TOEFL^® test's inception in 1976 to the TOEFL^® iBT today, *NUMBER* major progressions have been made. A procedure that incorporates consistent piloting, analysing and revising of items can only enhance a test's validity (Fulcher, 2000). In that light, the following four proposals are offered.

• Proposal 1: Widen response format

The J-NCT English listening section needs to have a have a wider response format which includes integrated and interactive test items rather than solely multiple-choice items. By incorporating a wider range of tasks and response formats, micro and macro level listening skills can be addressed.

• Proposal 2: Increase academic target language use

More tasks using academic content in the English listening section of the J-NCT would produce scores that more closely resemble the English listening language competencies required for the Japanese EFL university environment. In short, the target language use of the test needs to better reflect the intended target language use domain.

[ p. 83 ]

• Proposal 3: Pilot the J-NCT and delete items which perform poorly

By conducting piloting and/or pre-testing the ID and IF levels could be raised. That is, by having a system by which the right statistical procedures are followed, items which "misfit" or perform poorly would automatically be deleted. A Rasch analysis could be employed to do this.

• Proposal 4: Adopt Bachman & Palmer's test usefulness framework

Use the test usefulness framework (Bachman & Palmer, 1996) as a rudimentary, theoretical basis to ensure that future versions of the J-NCT English listening section comply with each element, reliability, authenticity, interactiveness, practicality, test impact, and content validity.
For example, by examining test impact, the degree to which the J-NCT English listening section's washback impacts the individual and the Japanese educational system as a whole may be ascertained. In sum, to examine how this influences the curriculum being studied at Japanese senior high schools; as it seems, today, many high schools offer special preparation courses for the J-NCT.

References

Bachman, L. & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press.

Brown, J. D. (1995a). Differences between norm-referenced and criterion-referenced Tests. In J. D. Brown and S. O. Yamashita (Eds.) Language Testing in Japan. pp. 12-19. Tokyo: Japan Association for Language Teaching.

Brown, J.D. (1995b). Developing norm-referenced language tests for program-level decision making. In J. D. Brown and S. O. Yamashita (Eds.) Language Testing in Japan. pp. 40-47. Tokyo: Japan Association for Language Teaching.

Buck, G. (2001). Assessing Listening. Cambridge: Cambridge University Press.

Butler, Y. G., & Iino, M. (2005). Current Japanese reforms in English language education: The 2003 "Action Plan". Language Policy, 4, 25-45.

Kokusaikyouiku koukann kyougikai (CIEE) Nihon Daihyobu. (2004). 2004 Nen TOEFL Tesuto (Test of English as a Foreign Language) Sukoa riyou jittaichousa houkokusho. [2004 TOEFL Test (Test of English as a Foreign Langauge) report using the TOEFL scores of universities in Japan]. Tokyo: Kokusaikyouiku koukann kyougikai (CIEE) Nihon Daihyobu TOEFL Jigyoubu.

Kokusaikyouiku koukann kyougikai (CIEE) Nihon Daihyobu. (2005). ETS TOEFL Internet Ban TOEFL Test Centre Boshuu Youkou. [ETS TOEFL Internet Based Test, Test Center Application Guidelines]. Tokyo: Kokusaikyouiku koukann kyougikai (CIEE) Nihon Daihyobu TOEFL Jigyoubu.

[ p. 84 ]

Chalhoub-Deville, M. & Turner, C. (2000). What to look for in ESL admission tests: Cambridge certificate exams, IELTS, and TOEFL. System, 28, 523-39.

Coombe, C., Kinney, J., Canning, C. (1998). Issues in foreign and second language academic listening assessment. In C. A. Coombe (Ed.). Current Trends in English Language Testing: Conference Proceedings for CTELT 1997 and 1998. 27-36. Accessed November 20, 2006 at http://eric.ed.gov/ERICDocs/data/ericdocs2/content_storage_01/0000000b/80/11/61/bc.pdf.

ETS. (2005). ETS TOEFL test and score data summary: 2004-2005 test year data. Princeton, NJ: Educational Testing Service.

ETS. (2006). The Official guide to the new TOEFL^® iBT. New York: The McGraw-Hill Companies, Inc.

ETS. (2006). TOEFL Home. Accessed April 20, 2006 at http://www.ets.org/portal/site/ets/menuitem.fab2360b1645a1de9b3a0779f1751509/?vgnextoid=69c0197a484f4010VgnVCM10000022f95190RCRD.

ETS. (2006). TOEFL Program, Board, and Committee of Examiners. Accessed August 28, 2006 at http://www.ets.org/portal/site/ets/menuitem.1488512ecfd5b8849a77b13bc3921509/?vgnextoid=f9f9af5e44df4010VgnVCM10000022f95190RCRD&vgnextchannel=f03ad898c84f4010VgnVCM10000022f95190RCRD.

Fulcher, G. (2000). The 'communicative' legacy in language testing. System, (28). 483-197.

Heaton, J.B. (1990). Writing English language tests (2nd ed.). London: Longman.

Henning, G. (1987). A guide to language testing. Cambridge, MA: Newbury House.

McNamara, T.F. (1996). Measuring second language performance. London and New York: Addison Wesley Longman.

McNamara, T. (2000). Language testing. Oxford: Oxford University Press.

MEXT. (2003, March 31). Regarding the establishment of an action plan to cultivate "Japanese with English bbilities." Accessed April 26, 2006 at http://www.mext.go.jp/english/topics/03072801.htm.

MEXT. (2005). Toukei [Statistics]. MEXT homepage. Accessed September 28, 2006 at http://www.mext.go.jp/b_menu/toukei/001/05122201/003.htm.

[ p. 85 ]

NCUEE. (2005). Heisei 17 Nendo Sentaa Shiken Heikinten ichiran. [NCUEE 2005 J-NCT Mean Scores]. Accessed April 20, 2006 at http://www.dnc.ac.jp/center_exam/17exam/17heikin.html.

NCUEE. (2006a). Guidelines in the 2005 school year. Accessed April 20, 2006 at http://www.dnc.ac.jp/dnc/gaiyou/pdf/youran_english_H17_HP.pdf.

NCUEE. (2006b). Sentaashiken jukenshasu, heikinten no suii (honshiken) heisei 18 nendo ikou. [Changes in number of examinees and average scores for the 2006 Center Examination.] Accessed June 15, 2006 at http://www.dnc.ac.jp/old_data/suii4.htm.

NCUEE. (2003, June 4). Heisei 18 nendo kara no daigaku nyuushi sentaashiken no shutsudai kyouka, kamoku nitsuite-saishuu matome. [Summary of the subjects covered in the 2006 University Entrance Centre Examination]. Accessed April 20, 2006 at http://www.dnc.ac.jp/center_exam/18kyouka-saishuu.html.

NCUEE. (2006d). Heisei 18 nendo Sentaashiken (honshiken) mondai, gaikokugo- risuningu & onsei mondai & sukuriputo. [Questions in the 2006 Center Examination: Foreign language-listening audio files & scripts]. Accessed April 20, 2006 from http://www.dnc.ac.jp/center_exam/18exam/18hon_mondai.html.

NCUEE. (2005, March 29). Heisei 17 nendo nendokeikaku. [2005 Annual Plan]. Accessed April 20, 2006 at http://www.dnc.ac.jp/dnc/gaiyou/nendo17.html.

NCUEE. (2006, July 28). Heisei 19 nendo Sentaa Shiken riyou daigaku. [Universities and colleges adopting the 2006 J-NCT]. Accessed September 28, 2006 at http://www.dnc.ac.jp/center_exam/19exam/riyou.html.

Popham, W. J. (1990). Modern educational measurement: a practitioner's perspective. 2nd ed. Boston, Mass: Allyn and Bacon.

Richards, J. (1983). Listening comprehension: Approach, design, procedure. TESOL Quarterly, (17) 2, 219 - 240.

Valette, R. M. (1997). Modern language testing. New York: Harcourt Brace Jovanovich.

Yomiuri Online. (2006). Sentaa shiken hatsu no risuningu. [The Centre Examination's First Listening Section.] Daily Yomiuri Online. Accessed April 20, 2006 at http://www.yomiuri.co.jp/kyoiku/news/20060130ur01.htm.[expired link]

Main Article	Appendix A	Appendix B	Appendix C	Appendix D	Appendix E
Appendix F	Appendix G	Appendix H	Appendix I	Appendix J	Appendix K

2006 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 86 ]

Authentic Communication: Proceedings of the 5th Annual JALT Pan-SIG Conference. May 13-14, 2006. Shizuoka, Japan: Tokai University College of Marine Science. (p. 74 - 98).

So what are we listening for? A comparison of the English listening constructs in the Japanese National Centre Test and TOEFL® iBT

Authentic Communication: Proceedings of the 5th Annual JALT Pan-SIG Conference.
May 13-14, 2006. Shizuoka, Japan: Tokai University College of Marine Science. (p. 74 - 98).

So what are we listening for?
A comparison of the English listening constructs in
the Japanese National Centre Test and TOEFL^® iBT