| An investigation of method effects on | 
| 
	 Keywords: reading comprehension, language testing, text structure, test method effects | 
 The main purpose of this research was to investigate the effects of factors other than language ability on reading 
comprehension test performance. The two main variables were text organization and response format. In addition, as 
a brief follow-up, the paper examines the reading passages used for some university entrance examinations in Japan 
to examine actual testing practice.
The main purpose of this research was to investigate the effects of factors other than language ability on reading 
comprehension test performance. The two main variables were text organization and response format. In addition, as 
a brief follow-up, the paper examines the reading passages used for some university entrance examinations in Japan 
to examine actual testing practice.
 The theoretical framework for this research is firstly Bachman's model (1990) of language ability and test method facets 
and secondly Meyer's model (1975, 1985) of prose analysis. Kintsch and Yarbrough's study (1982) also helped operationalize the response format variable.
The theoretical framework for this research is firstly Bachman's model (1990) of language ability and test method facets 
and secondly Meyer's model (1975, 1985) of prose analysis. Kintsch and Yarbrough's study (1982) also helped operationalize the response format variable.
[ p. 64 ]
 Bachman (1990), which was later modified in Bachman and Palmer (1996), presents a model of language ability. He 
includes 'test method facets' in his discussion of language ability and draws attention to a range of factors which 
can affect test performance. Bachman and Palmer (1996:62) posit the importance of method facets, which they now term 
'task characteristics' as follows:
Bachman (1990), which was later modified in Bachman and Palmer (1996), presents a model of language ability. He 
includes 'test method facets' in his discussion of language ability and draws attention to a range of factors which 
can affect test performance. Bachman and Palmer (1996:62) posit the importance of method facets, which they now term 
'task characteristics' as follows:Language use involves complex and multiple interactions among the various individual characteristics of language users, on the one hand, and between these characteristics and the characteristics of the language use or testing situation, on the other. Because of the complexity of these interactions, we believe that language ability must be considered within an interactional framework of language use.
 Bachman classifies test method facets into five categories: 1) testing environment; 2) test rubrics; 3) the nature of 
the input; 4) the nature of the expected response; and 5) the interaction between the input and the response. According 
to Bachman, these factors can affect test performance; it is important for testers to be aware of their influences and, 
if possible, minimize them. This study focuses on the third and fourth of these facets. 'The nature of the input' (the 
materials presented to test takers) was chosen as the main variable of this study because reading materials are a very 
important factor in reading comprehension tests. Background knowledge, for example, is a well-researched area (see for 
example Alderson and Urquhart 1983, 1985a, 1985b; Bernhardt 1991; Carrell and Eisterhold 1983; Clapham 1996; Johnson 1981,
1982; Mohammed and Swales 1984; Salager-Meyer 1991; Steffensen and Joag-Dev 1984; Steffensen et al. 1979; Ulijn and 
Strother 1990). Of the various factors, text organization, especially rhetorical organization, was chosen for this 
investigation.  This decision was based on an extensive literature survey of previous studies on text characteristics 
and readability (e.g. a series of studies by Beck and his colleagues, e.g. 1982, 1984, 1989, 1991, 1995; Britton et al. 
1989; Davison and Kantor 1982; Duffy and Kabance 1982; Duffy et al. 1989; Graves et al. 1988, 1991; Klare 1985; Olsen and 
Johnson 1989; Reder and Anderson 1980; Urquhart, 1984).
Bachman classifies test method facets into five categories: 1) testing environment; 2) test rubrics; 3) the nature of 
the input; 4) the nature of the expected response; and 5) the interaction between the input and the response. According 
to Bachman, these factors can affect test performance; it is important for testers to be aware of their influences and, 
if possible, minimize them. This study focuses on the third and fourth of these facets. 'The nature of the input' (the 
materials presented to test takers) was chosen as the main variable of this study because reading materials are a very 
important factor in reading comprehension tests. Background knowledge, for example, is a well-researched area (see for 
example Alderson and Urquhart 1983, 1985a, 1985b; Bernhardt 1991; Carrell and Eisterhold 1983; Clapham 1996; Johnson 1981,
1982; Mohammed and Swales 1984; Salager-Meyer 1991; Steffensen and Joag-Dev 1984; Steffensen et al. 1979; Ulijn and 
Strother 1990). Of the various factors, text organization, especially rhetorical organization, was chosen for this 
investigation.  This decision was based on an extensive literature survey of previous studies on text characteristics 
and readability (e.g. a series of studies by Beck and his colleagues, e.g. 1982, 1984, 1989, 1991, 1995; Britton et al. 
1989; Davison and Kantor 1982; Duffy and Kabance 1982; Duffy et al. 1989; Graves et al. 1988, 1991; Klare 1985; Olsen and 
Johnson 1989; Reder and Anderson 1980; Urquhart, 1984). 
| ". . . five types of top-level relationships are thought to represent patterns in the way we think . . . " | 
 After exploring ways of operationalizing text organization, a difficult concept to express in concrete terms, Meyer's model 
of prose analysis was discovered (Meyer 1975, 1985). In Meyer's content structure analysis, idea units are organized in a 
hierarchical manner on the basis of their rhetorical relationships. The rhetorical relation at the highest level in the 
hierarchy is called the top-level rhetorical organization and this characterizes the text. The top-level rhetorical 
structure is identified as one of the following: 'collection', 'causation', 'response', 'description' or 'comparison' 
(Meyer later renamed 'response' as  'problem-solution'. This latter term has been used in this study). These five types 
of top-level relationships are thought to represent patterns in the way we think (Meyer 1985: 20).
After exploring ways of operationalizing text organization, a difficult concept to express in concrete terms, Meyer's model 
of prose analysis was discovered (Meyer 1975, 1985). In Meyer's content structure analysis, idea units are organized in a 
hierarchical manner on the basis of their rhetorical relationships. The rhetorical relation at the highest level in the 
hierarchy is called the top-level rhetorical organization and this characterizes the text. The top-level rhetorical 
structure is identified as one of the following: 'collection', 'causation', 'response', 'description' or 'comparison' 
(Meyer later renamed 'response' as  'problem-solution'. This latter term has been used in this study). These five types 
of top-level relationships are thought to represent patterns in the way we think (Meyer 1985: 20).
 The link between the ideas is weakest in 'collection', where ideas are loosely associated with each other around a common 
topic. 'Time sequence' is another type of 'collection', for example when recounting events in chronological order. In the 
'causation' relation, the ideas are related both in terms of time (i.e. one event happens before another) and causality 
(i.e. the earlier event causes the latter). Finally, the 'response' ('problem-solution') relation involves more 
inter-relationship between ideas in that a solution is suggested in response to the existing causality.
The link between the ideas is weakest in 'collection', where ideas are loosely associated with each other around a common 
topic. 'Time sequence' is another type of 'collection', for example when recounting events in chronological order. In the 
'causation' relation, the ideas are related both in terms of time (i.e. one event happens before another) and causality 
(i.e. the earlier event causes the latter). Finally, the 'response' ('problem-solution') relation involves more 
inter-relationship between ideas in that a solution is suggested in response to the existing causality. 
 'Comparison' and 'description' are on a different plane from the others because they are based on a hierarchy or 
subordination of ideas. In a 'description' relation, ideas are arranged in a hierarchical manner: "one argument is 
superordinate and the other modifies this superordinate argument" (Meyer 1985: 20). The 'comparison' relation has at 
least two subordinate arguments which are linked by an element of comparison. This means that there is more interlinking 
in the 'comparison' relation than in the 'description' relation.
'Comparison' and 'description' are on a different plane from the others because they are based on a hierarchy or 
subordination of ideas. In a 'description' relation, ideas are arranged in a hierarchical manner: "one argument is 
superordinate and the other modifies this superordinate argument" (Meyer 1985: 20). The 'comparison' relation has at 
least two subordinate arguments which are linked by an element of comparison. This means that there is more interlinking 
in the 'comparison' relation than in the 'description' relation. 
 Based on a number of empirical studies (Meyer and Freedle 1984; Meyer et al. 1980, 1993), Meyer claims that ideas are 
more easily remembered when presented in tightly-organized texts because of the close link between the ideas. Her claim 
has been supported by other researchers (e.g. Carrell 1984; Goh 1990; McGee 1982; Richgels et al. 1987). This study 
builds on these findings and explores their applicability in foreign language reading comprehension tests. After 
exploring this issue in a preliminary study, four text types: 'association', 'description', 'causation' and 
'problem-solution' were chosen.
Based on a number of empirical studies (Meyer and Freedle 1984; Meyer et al. 1980, 1993), Meyer claims that ideas are 
more easily remembered when presented in tightly-organized texts because of the close link between the ideas. Her claim 
has been supported by other researchers (e.g. Carrell 1984; Goh 1990; McGee 1982; Richgels et al. 1987). This study 
builds on these findings and explores their applicability in foreign language reading comprehension tests. After 
exploring this issue in a preliminary study, four text types: 'association', 'description', 'causation' and 
'problem-solution' were chosen.
[ p. 65 ]
| ". . . open-ended questions . . . [are] more effective in measuring the understanding of main ideas of . . . [a] text whereas cloze tests only . . . [touch] upon local understanding . . . " | 
 Another variable to be investigated was response format. Meyer and her associates used 'recall', that is, asking students 
to reproduce what they have read, as a way of measuring reading comprehension. But recall is not a common format in 
second language testing. Therefore, it was considered more worthwhile to examine more typical test formats.  There are 
a number of research studies on the effects of test format on test performance (e.g. Graesser et al. 1980; Graves et al. 
1991; Kintsch and Yarbrough 1982; Lewkowicz 1983; Reder and Anderson 1980; Shohamy 1984; Shohamy and Inbar 1991). Among 
others, Kintsch and Yarbrough's (1982) study proved inspirational. They investigated the effects of two test formats on 
reading comprehension: open-ended questions and cloze tests. They found that open-ended questions were more effective in 
measuring the understanding of main ideas of the text whereas cloze tests only touched upon local understanding and did 
not reflect the reader's overall understanding. Since text organization was the primary focus of this study, their 
findings were particularly relevant and it was decided to adopt their approach.
Another variable to be investigated was response format. Meyer and her associates used 'recall', that is, asking students 
to reproduce what they have read, as a way of measuring reading comprehension. But recall is not a common format in 
second language testing. Therefore, it was considered more worthwhile to examine more typical test formats.  There are 
a number of research studies on the effects of test format on test performance (e.g. Graesser et al. 1980; Graves et al. 
1991; Kintsch and Yarbrough 1982; Lewkowicz 1983; Reder and Anderson 1980; Shohamy 1984; Shohamy and Inbar 1991). Among 
others, Kintsch and Yarbrough's (1982) study proved inspirational. They investigated the effects of two test formats on 
reading comprehension: open-ended questions and cloze tests. They found that open-ended questions were more effective in 
measuring the understanding of main ideas of the text whereas cloze tests only touched upon local understanding and did 
not reflect the reader's overall understanding. Since text organization was the primary focus of this study, their 
findings were particularly relevant and it was decided to adopt their approach.  
 After a pilot study and further reading, this study included another test format: summary writing. Summary writing seemed 
to be even more sensitive to overall understanding than open-ended questions (Bensoussan and Kreidler 1990). The problem 
with open-ended questions is that different types of questions require varying levels of reading skills and varying 
amounts of information in the reading passage. If questions are asked about minor details or are related to local 
understanding, they do not require the reader to grasp the meaning of the whole text. Even though open-ended questions 
can touch upon the main themes of a text, questions normally prompt the reader to focus on specific ideas in the text. 
On the other hand, to write a summary, the reader needs to be able to distinguish the main ideas from minor details and 
to identify the macrostructure of a text. This seemed to suggest that the text structure involved in the passages could 
have a strong impact on performance in summary writing.
After a pilot study and further reading, this study included another test format: summary writing. Summary writing seemed 
to be even more sensitive to overall understanding than open-ended questions (Bensoussan and Kreidler 1990). The problem 
with open-ended questions is that different types of questions require varying levels of reading skills and varying 
amounts of information in the reading passage. If questions are asked about minor details or are related to local 
understanding, they do not require the reader to grasp the meaning of the whole text. Even though open-ended questions 
can touch upon the main themes of a text, questions normally prompt the reader to focus on specific ideas in the text. 
On the other hand, to write a summary, the reader needs to be able to distinguish the main ideas from minor details and 
to identify the macrostructure of a text. This seemed to suggest that the text structure involved in the passages could 
have a strong impact on performance in summary writing.
 In addition to the two main variables, learners' language proficiency level was selected as a third variable: the results 
of the pilot test had suggested that the impact of this factor would also be worth exploring. The participants were 
therefore divided into three groups according to their results on a short English proficiency test described below.
In addition to the two main variables, learners' language proficiency level was selected as a third variable: the results 
of the pilot test had suggested that the impact of this factor would also be worth exploring. The participants were 
therefore divided into three groups according to their results on a short English proficiency test described below.  
 The research questions were formulated as follows:
	The research questions were formulated as follows: 
 The research methods are as follows:
The research methods are as follows: In total 735 Japanese university students participated.  
The majority of them were 18-19 years of age and in their first- or second-year.  These students were mainly 
at lower-intermediate to intermediate levels of English proficiency. They were randomly divided into twelve 
groups to cater for the variables, with each student receiving one of a selection of 12 reading comprehension 
tests (see below).
In total 735 Japanese university students participated.  
The majority of them were 18-19 years of age and in their first- or second-year.  These students were mainly 
at lower-intermediate to intermediate levels of English proficiency. They were randomly divided into twelve 
groups to cater for the variables, with each student receiving one of a selection of 12 reading comprehension 
tests (see below).[ p. 66 ]
 Two tests were administered to the students:
Two tests were administered to the students:1. A 50-item English proficiency test which was mainly based on grammar and vocabulary
 The purposes of the test were to establish the comparability of the twelve student groups and identify three different 
proficiency groups as a basis for comparison at a later stage of the study. The test statistics were:
The purposes of the test were to establish the comparability of the twelve student groups and identify three different 
proficiency groups as a basis for comparison at a later stage of the study. The test statistics were: 
 = 29.7 out of 50; S.D. = 8.07; reliability alpha = .82; 
facility values ranged from .17 to .99 with a mean of .59; item-total correlation ranged from .08 to .53 with a 
mean of .34.
 = 29.7 out of 50; S.D. = 8.07; reliability alpha = .82; 
facility values ranged from .17 to .99 with a mean of .59; item-total correlation ranged from .08 to .53 with a 
mean of .34.
2. A variety of reading comprehension tests
 The texts used in the study were specially prepared to maximise control over the variables identified in the pilot study. 
On the basis of expert judgement regarding their suitability as representative samples of the selected text types, 
two sets of texts concerning 'international aid' and 'sea safety' were finally selected for use in the study. The mean 
length of the texts was 369.3 words (with the range of 352-384), and the mean score was 64.4 (with the range of 58.5-69.9) 
on the Flesch Reading Ease Formula, which is one of the most widely recognised readability indices. After two sets of four 
texts had been selected, test items were developed for each text in three formats: cloze, open-ended questions, and summary writing.
The texts used in the study were specially prepared to maximise control over the variables identified in the pilot study. 
On the basis of expert judgement regarding their suitability as representative samples of the selected text types, 
two sets of texts concerning 'international aid' and 'sea safety' were finally selected for use in the study. The mean 
length of the texts was 369.3 words (with the range of 352-384), and the mean score was 64.4 (with the range of 58.5-69.9) 
on the Flesch Reading Ease Formula, which is one of the most widely recognised readability indices. After two sets of four 
texts had been selected, test items were developed for each text in three formats: cloze, open-ended questions, and summary writing.
 The number of items for each test were 25 for the cloze test; 5 for the open-ended question format; and 10 for summary 
writing. Two response formats & open-ended questions and summary writing & were set in Japanese, the students' first 
language, to eliminate undesirable effects of the use of English on reading performance.
The number of items for each test were 25 for the cloze test; 5 for the open-ended question format; and 10 for summary 
writing. Two response formats & open-ended questions and summary writing & were set in Japanese, the students' first 
language, to eliminate undesirable effects of the use of English on reading performance.
 Every effort was made to maximize the comparability across the eight tests. To achieve this, extensive use was made of 
expert judgements (see below). For example, for the cloze tests, the deletion rate (every 13th word) was decided on the 
basis of the results of the pilot study, and the starting points for deletion were decided after extensive analysis of 
the nature and types of potential cloze items (see Appendix 1).
Every effort was made to maximize the comparability across the eight tests. To achieve this, extensive use was made of 
expert judgements (see below). For example, for the cloze tests, the deletion rate (every 13th word) was decided on the 
basis of the results of the pilot study, and the starting points for deletion were decided after extensive analysis of 
the nature and types of potential cloze items (see Appendix 1).  Ideally, all the students would have received all the versions to facilitate comparison of test performance. 
However, this approach had two limitations: in practicality and validity. It was impractical for the students to 
take all 24 tests, considering the amount of time required. Secondly, the validity of the research would have been 
undermined if the students had been given all 24 texts because they would have read a set of eight texts three times. 
Shohamy (1984) questions the validity of the study of Samson (1983), who compared three test formats by allowing the 
participants to take all the versions based on the same passage. Furthermore, in my study, the four texts within each 
topic were fairly similar, varying only in text structure. This would have caused a similar problem, arising from familiarity effects.
Ideally, all the students would have received all the versions to facilitate comparison of test performance. 
However, this approach had two limitations: in practicality and validity. It was impractical for the students to 
take all 24 tests, considering the amount of time required. Secondly, the validity of the research would have been 
undermined if the students had been given all 24 texts because they would have read a set of eight texts three times. 
Shohamy (1984) questions the validity of the study of Samson (1983), who compared three test formats by allowing the 
participants to take all the versions based on the same passage. Furthermore, in my study, the four texts within each 
topic were fairly similar, varying only in text structure. This would have caused a similar problem, arising from familiarity effects. 
 Therefore, a matrix sampling procedure was adopted in which each student was given only two of the 24 texts.  Each student would receive one 
text from each topic, 'international aid' and 'sea safety'.  Both of these texts would be of the same text type and in 
one of the three test formats. This meant that there would be 12 participant groups, each taking a different set of test 
versions, varying in text type and response format. Table 1 below summarises the 12 participant groups. For example, 
Group 2 would take a cloze test with two 'causation' texts while Group 9 would write summaries of two 'association' 
texts. A one-way ANOVA was conducted, and it was statistically confirmed that there was no significant difference 
among the twelve groups in their English language proficiency (F (11, 723) = 0.39, n.s.).
Therefore, a matrix sampling procedure was adopted in which each student was given only two of the 24 texts.  Each student would receive one 
text from each topic, 'international aid' and 'sea safety'.  Both of these texts would be of the same text type and in 
one of the three test formats. This meant that there would be 12 participant groups, each taking a different set of test 
versions, varying in text type and response format. Table 1 below summarises the 12 participant groups. For example, 
Group 2 would take a cloze test with two 'causation' texts while Group 9 would write summaries of two 'association' 
texts. A one-way ANOVA was conducted, and it was statistically confirmed that there was no significant difference 
among the twelve groups in their English language proficiency (F (11, 723) = 0.39, n.s.). 
[ p. 67 ]

 Furthermore, to eliminate an order effect, the order of the two texts was counterbalanced in each set of tests. This 
resulted in 24 different sets of test booklets: two sets of twelve different tests. The test booklets, 24 
different versions, were arranged so that each version would be randomly distributed among the students. In this way, 
the students were randomly divided into twelve groups.
Furthermore, to eliminate an order effect, the order of the two texts was counterbalanced in each set of tests. This 
resulted in 24 different sets of test booklets: two sets of twelve different tests. The test booklets, 24 
different versions, were arranged so that each version would be randomly distributed among the students. In this way, 
the students were randomly divided into twelve groups.  **ADD NUMBER HERE** expert judges were invited to assist at different stages of this study, from text selection and item 
analysis to establishing marker reliability. For example a number of people were asked to analyse test items in detail 
in order to maximise the comparability across the eight texts.
**ADD NUMBER HERE** expert judges were invited to assist at different stages of this study, from text selection and item 
analysis to establishing marker reliability. For example a number of people were asked to analyse test items in detail 
in order to maximise the comparability across the eight texts.  Using SPSS/PC, descriptive statistics were calculated. Furthermore, ANOVAs (both one-way and 
two-way analysis of variance) were conducted to test the research hypotheses. The significance level was set at p < .05. 
To assess the reliability of the researcher's marking, 15% of the papers of for the open-ended questions and the summary 
writing task were independently marked by two other expert judges. The correlation coefficients among the raters ranged 
from .85 to .92. These were deemed to be satisfactory for the purpose.
Using SPSS/PC, descriptive statistics were calculated. Furthermore, ANOVAs (both one-way and 
two-way analysis of variance) were conducted to test the research hypotheses. The significance level was set at p < .05. 
To assess the reliability of the researcher's marking, 15% of the papers of for the open-ended questions and the summary 
writing task were independently marked by two other expert judges. The correlation coefficients among the raters ranged 
from .85 to .92. These were deemed to be satisfactory for the purpose. Figure 1 below shows the students' mean scores on the reading tests for the four different types of text structure and 
the three types of response format.
Figure 1 below shows the students' mean scores on the reading tests for the four different types of text structure and 
the three types of response format. 

| ". . . reading comprehension is assessed through open-ended questions, it does not matter what kind of text structure is involved as long as there is some kind of structure . . ." | 
 The figure shows that, in the cloze tests, the mean scores were highest in 'association' texts and lowest in 
'problem-solution' texts. In other words, comprehension performance as measured by the cloze format was better in 
loosely-organized texts and became poorer as the text structure became tighter. On the other hand, in open-ended 
questions and summary writing, the students' mean scores were lowest in 'association' texts, the most loosely 
organized texts. While the highest scores for open-ended questions were in 'description' texts, for summary writing 
the highest scores were in 'causation' texts. More generally, for the summary writing response format the two most 
tightly-organized texts ('causation' and 'problem-solution' texts) produced the highest mean scores, whereas for 
the open-ended response format equally high values were observed in three text types ('description', 'causation', 
and 'problem-solution' texts). This may suggest that when reading comprehension is assessed through open-ended questions, 
it does not matter what kind of text structure is involved as long as there is some kind of structure. There seems to be 
a clear distinction between cloze tests and the other two formats in their interaction with types of text organization. 
This difference was statistically significant (see Appendix 2). In other words, it can 
be claimed that test performance is affected by text type and response format.
The figure shows that, in the cloze tests, the mean scores were highest in 'association' texts and lowest in 
'problem-solution' texts. In other words, comprehension performance as measured by the cloze format was better in 
loosely-organized texts and became poorer as the text structure became tighter. On the other hand, in open-ended 
questions and summary writing, the students' mean scores were lowest in 'association' texts, the most loosely 
organized texts. While the highest scores for open-ended questions were in 'description' texts, for summary writing 
the highest scores were in 'causation' texts. More generally, for the summary writing response format the two most 
tightly-organized texts ('causation' and 'problem-solution' texts) produced the highest mean scores, whereas for 
the open-ended response format equally high values were observed in three text types ('description', 'causation', 
and 'problem-solution' texts). This may suggest that when reading comprehension is assessed through open-ended questions, 
it does not matter what kind of text structure is involved as long as there is some kind of structure. There seems to be 
a clear distinction between cloze tests and the other two formats in their interaction with types of text organization. 
This difference was statistically significant (see Appendix 2). In other words, it can 
be claimed that test performance is affected by text type and response format.
[ p. 68 ]
 More interestingly, the two-way interaction between the two effects proved to be statistically significant 
(F (6, 723) = 6,149**, p < .005). This means that text types and response format not only had significant effects 
on reading comprehension separately, but they also interacted with each other.
More interestingly, the two-way interaction between the two effects proved to be statistically significant 
(F (6, 723) = 6,149**, p < .005). This means that text types and response format not only had significant effects 
on reading comprehension separately, but they also interacted with each other. 
 It is interesting to find that the presence of clear text structure did not help reading comprehension performance in 
cloze tests, and perhaps even hindered it. No other studies have been conducted in this area, so it is difficult to 
explain this pattern. It may be related to the density of information; tightly-organized texts may compress more different 
ideas into a limited space so as to include all elements needed to develop an argument, and therefore may contain more 
new words (see Kintsch and Keenan 1973).  As the frequency of a word's recurrence in a text seems to be one of the 
factors affecting cloze item difficulty (see Kobayashi 2002b), this is an interesting area to explore further.
It is interesting to find that the presence of clear text structure did not help reading comprehension performance in 
cloze tests, and perhaps even hindered it. No other studies have been conducted in this area, so it is difficult to 
explain this pattern. It may be related to the density of information; tightly-organized texts may compress more different 
ideas into a limited space so as to include all elements needed to develop an argument, and therefore may contain more 
new words (see Kintsch and Keenan 1973).  As the frequency of a word's recurrence in a text seems to be one of the 
factors affecting cloze item difficulty (see Kobayashi 2002b), this is an interesting area to explore further.
 When the results were examined in terms of the learners' language proficiency level, more striking results emerged. 
The following set of figures (Figures 2-4) show the mean scores of the three proficiency groups.
When the results were examined in terms of the learners' language proficiency level, more striking results emerged. 
The following set of figures (Figures 2-4) show the mean scores of the three proficiency groups.
|  Figure 3. Open-Ended Questions results (%) by proficiency levels. |  Figure 4. Summary Writing results (%) by proficiency levels. | 
 Overall the effects of tighter text organization were more apparent with higher proficiency learners, notably when the 
open-ended questions and summary writing were used as the response format. By contrast, the performance of less proficient 
students showed little variation according to text type or test format. This was again statistically confirmed (see Appendix 2).
Overall the effects of tighter text organization were more apparent with higher proficiency learners, notably when the 
open-ended questions and summary writing were used as the response format. By contrast, the performance of less proficient 
students showed little variation according to text type or test format. This was again statistically confirmed (see Appendix 2). 
[ p. 69 ]
 From this finding, it can be posited that, in open-ended questions and summary writing, the impact of different kinds of 
text organization varies considerably across different proficiency groups. When texts with looser structures were used, 
the reading comprehension measured by these response formats did not correspond to general language proficiency as much 
as when more tightly-organized texts were used. This seems to suggest that, in these test formats students of higher 
proficiency could be unfairly disadvantaged and their proficiency may not be reflected accurately in test performance 
if less structured passages are presented.
From this finding, it can be posited that, in open-ended questions and summary writing, the impact of different kinds of 
text organization varies considerably across different proficiency groups. When texts with looser structures were used, 
the reading comprehension measured by these response formats did not correspond to general language proficiency as much 
as when more tightly-organized texts were used. This seems to suggest that, in these test formats students of higher 
proficiency could be unfairly disadvantaged and their proficiency may not be reflected accurately in test performance 
if less structured passages are presented. This research has employed Bachman's influential model of language ability and test method facets as an organizing 
framework. The findings of this study have provided data to support two aspects of his model: the effect of the nature 
of the input and the nature of the expected response on reading comprehension. More research needs to be conducted in 
this area so that the findings reported here can be illuminated further, but it seems that the main implications of this 
study for language testing and second language research are clear.
This research has employed Bachman's influential model of language ability and test method facets as an organizing 
framework. The findings of this study have provided data to support two aspects of his model: the effect of the nature 
of the input and the nature of the expected response on reading comprehension. More research needs to be conducted in 
this area so that the findings reported here can be illuminated further, but it seems that the main implications of this 
study for language testing and second language research are clear.
 Very often, test results are used as evidence for making important decisions.  For example, test results may be used to 
decide whether a student should be admitted to university, whether a prospective employee should be hired, or whether a 
project should continue or not. This study has clearly demonstrated that there is a systematic relationship between the 
students' test performance and the two variables examined. Therefore, it is extremely important for language testers, 
or anyone who makes judgements on the basis of test results, to pay attention to the test methods used when they produce 
their assessment instruments or interpret test scores.
Very often, test results are used as evidence for making important decisions.  For example, test results may be used to 
decide whether a student should be admitted to university, whether a prospective employee should be hired, or whether a 
project should continue or not. This study has clearly demonstrated that there is a systematic relationship between the 
students' test performance and the two variables examined. Therefore, it is extremely important for language testers, 
or anyone who makes judgements on the basis of test results, to pay attention to the test methods used when they produce 
their assessment instruments or interpret test scores.  As a brief follow-up, this research has recently been extended to investigate text structures involved in actual reading 
passages used for university entrance examinations in Japan. Reading passages from the Centre Examinations for the 
past seven years were examined. Altogether 28 passages were examined, four for each year: Question 4 & Question 6 
of both the main exam and the additional exam for those who could not take the first one. Out of the 28 passages, 
half of them were narrative, mostly heart-warming stories with some moral messages. These passages did not have any 
clear text structure, apart from a loose time sequence. The other half were expository texts, involving charts or tables. 
The analysis revealed that the vast majority of these texts had 'description' as the main text organization, and none 
had 'causation' or 'problem-solution'. As discussed earlier in my research, 'description' is not tightly organized 
compared with 'causation' or 'problem-solution'. This lack of clear structure in many of the actual reading passages 
used for the Centre Examinations seems to present a problem as the more proficient students could be disadvantaged.
As a brief follow-up, this research has recently been extended to investigate text structures involved in actual reading 
passages used for university entrance examinations in Japan. Reading passages from the Centre Examinations for the 
past seven years were examined. Altogether 28 passages were examined, four for each year: Question 4 & Question 6 
of both the main exam and the additional exam for those who could not take the first one. Out of the 28 passages, 
half of them were narrative, mostly heart-warming stories with some moral messages. These passages did not have any 
clear text structure, apart from a loose time sequence. The other half were expository texts, involving charts or tables. 
The analysis revealed that the vast majority of these texts had 'description' as the main text organization, and none 
had 'causation' or 'problem-solution'. As discussed earlier in my research, 'description' is not tightly organized 
compared with 'causation' or 'problem-solution'. This lack of clear structure in many of the actual reading passages 
used for the Centre Examinations seems to present a problem as the more proficient students could be disadvantaged. 
 The types of questions which appeared in the Centre Examinations are a further source of concern. Kobayashi (1995, 2004b) 
discovered that local level questions tended to have poor discrimination between students. It is therefore worrying that 
the vast majority of the questions in the Centre Examinations seemed to require only a small amount of context. It is 
also a problem that some questions only required the ability to understand a chart or table, not the comprehension of 
the content of a passage.
The types of questions which appeared in the Centre Examinations are a further source of concern. Kobayashi (1995, 2004b) 
discovered that local level questions tended to have poor discrimination between students. It is therefore worrying that 
the vast majority of the questions in the Centre Examinations seemed to require only a small amount of context. It is 
also a problem that some questions only required the ability to understand a chart or table, not the comprehension of 
the content of a passage. 
 Of course, this investigation is exploratory and has been limited in its scope, but this finding seems to present important 
practical problems with the Centre Examination. This is a high-stakes examination and therefore it is essential that it 
should be well-designed. This research has identified a number of issues which should be taken into account during the 
future reviews of the examinations.
Of course, this investigation is exploratory and has been limited in its scope, but this finding seems to present important 
practical problems with the Centre Examination. This is a high-stakes examination and therefore it is essential that it 
should be well-designed. This research has identified a number of issues which should be taken into account during the 
future reviews of the examinations.
[ p. 70 ]
[ p. 71 ]
[ p. 72 ]
| Main Article | Appendix 1 | Appendix 2 | 
 Topic Index
	Topic Index Author Index
	Author Index Page Index
	Page Index Title Index
	Title Index Main Index
	Main Index
 Topic Index
	Topic Index Author Index
	Author Index Page Index
	Page Index Title Index
	Title Index Main Index
	Main Index