Is appropriate appropriate? An investigation of interpersonal semantic stability

Proceedings of the 2nd Annual JALT Pan-SIG Conference. May 10-11, 2003. Kyoto, Japan: Kyoto Institute of Technology.

Is appropriate appropriate?
An investigation of interpersonal semantic stability

by H.P.L. Molloy Temple University Japan

Abstract

This paper presents an empirical evaluation of the rating-scale anchor word: "appropriate." A multidimensional scaling study and interpretation of descriptive statistics indicate that "appropriate" is an unstable concept among highly proficient English users and problematize the use of single words or concepts as anchors for rating scales in pragmatics research.

Keywords: appropriate; rating scales, multidimensional scaling, anchor words

Planning a study on the ability of Japanese university students to recognize appropriate advice, I hoped to use unmarked Likert-type scales anchored with the words "appropriate" and "inappropriate," anchors which are used fairly often in the literature (e.g., Hudson, Detmer, and Brown 1992).
Unfortunately, "appropriate" is a word that does not seem to have been defined in the literature. Particularly, there does not seem to have been any operational definitions developed. As in my study of appropriate advice I planned to use judges to rate student-participants' linguistic output in terms of degree of appropriateness, it behooved me to try to develop a definition of "appropriate" to use in scoring guidelines.
I did not, for several reasons, wish to develop an operational definition on my own. First, I have no reason to believe that my reactions are similar to the reactions of most highly proficient English users. Second, there is no reason to believe that a definition developed by an applied linguist would be similar to one used by language users without training in applied linguistics; indeed, I tend to believe that it is probably a better idea to develop operational definitions from the reactions or linguistic behavior of naive, untrained language users, as such are those most often encountered by language users. The reactions of non-linguists, that is, are arguably more important for language users than the reactions of linguists. I therefore decided to try to develop a definition empirically, from common features of advice compared with regard to appropriateness by highly proficient users of English. My goal was to develop a definition that would allow me to instruct judges to mark a piece of advice as appropriate if it contained or was marked by specific features.
At this first point in the exploration of the use of "appropriate," the intent was to see if the linguistic features of particular pieces of language, rather than the interaction between language and context, would yield consistent interpretations of the word "appropriate."
[ p. 17 ]
Should I have been able to develop such an empirical definition of "appropriate," I could have been somewhat more confident that the scoring of raters would be akin to the judgments naïve language users use: although precise operational definitions and training both lead to greater consistency among raters — see McNamara's 1996 book for a review of the literature — consistency in the use of the word "appropriate" would not guarantee the word was being used as it is in the wild, so to speak.
Research questions

What do linguistic features do highly proficient users of English consider when judging the appropriateness of advice?

Does "appropriateness" mean the same thing to different people?

Can an empirical definition of "appropriateness" be developed?

Method

Operational definitions

For this paper, two operational definitions have import. "Japanese": participants who reported Japanese as their first language, whether or not they actually held a Japanese passport were considered Japanese. "Highly proficient": participants who used English professionally, and seemed in my estimation to be able to use the language to convey a wide range of meanings, and who were not formally studying to improve their ability to use the language (as opposed to studying to master specific registers) were considered highly proficient. Participants were not tested formally.

Participants

Phase I: 113 first- or second-year Japanese university students of English from three different intact classes. Final N = 107

Phase II: 122 first-year Japanese university students of English from four intact classes. Final N = 118.

Phase III: 113 first-year Japanese university students of English from five different intact classes. Final N = 113.

(Note: Phases I-III involved twelve different intact classes. No participant was included in more than one piloting phase.)

Phase IV: 28 highly proficient users of English. Final N = 18.

All student-participants were my own students; all non-student participants were friends or acquaintances of the researcher. They ranged from my closest friend to a few people I had only spoken to once before. Further information about the Phase IV participants is presented in Table I. Note that the participant "Lars" is the researcher.

Materials

Phase I: Copies of a single A4-size data collection instrument printed on one side. Written instructions were in English only. The prompt was written to follow the same previous research data-collection procedures reported in Molloy and Shimura (2002). The instrument was distributed and completed during regular class meetings for most participants. Fewer than ten participants (who had been absent during the regular class meetings) filled out the instrument for homework. The instrument appears in Appendix 1 and includes an informed-consent agreement, which is also in English. This instrument was used to collect descriptions of situations in which student-participants actually had given advice.

[ p. 18 ]
Phase II: Copies of a single A4-size data collection instrument printed on both sides. Written instructions were included in English only. The items comprised series of 13 items collected in Phase I, each with a five-point check-box-style Likert-type scale modeled after Dörnyei (2001). The points were labeled, from left, "Very unrealistic," "Unrealistic," "Average," "Realistic," and "Very realistic." The instrument and informed-consent agreement is reproduced in Appendix 2. This instrument was used to develop a pool of prompt items considered realistic by a large percentage of participants similar to the participants in Phase III.

Phase III: Copies of several two-sheet sets of double-sided A4-size containing two prompts on each side from Phase II. Neither written instructions nor an informed consent agreement was included; instructions were given orally and on the board; participants were instructed to indicate agreement to allow their answers to be used anonymously for research purposes by marking the data collection sheet with "OK." Participants were instructed to not write their names on the instrument. This instrument was used to collect a large number of examples of advice-giving for each of the most highly rated prompts from the Phase II study.

Phase IV: Thirty different 48-page A5-size booklets. Written instructions that were minimally adapted from Schiffman, Reynolds, and Young (1981), were included on the cover page. The adaptations I made were only to change the nouns in the instructions from those used for the soft-drink marketing research by Schiffman and colleagues to those needed for this study. The researcher's mailing address was included on the back cover. There was one blank page.

The remaining 45 pages contained three items:

A sample single prompt randomly selected from Phase III.

How did you decide on your university? I want to study history, but my parents want me to go to a college of medicine. I'm in a dilemma: I want to meet my parents' expectations, but I'm interested in history. Which should I choose?

Two randomly selected responses to the prompt collected in Phase III.

A single scale comprising a single, continuous, unmarked line anchored by "Different" and "Same" on left and right, respectively.

Two out of the ten different responses which appear in Table 2 were printed on the response sheets, giving a total of 45 combinations of pairs. (It was assumed that prompts which compared with themselves would be rated as 100% the same.) A sample page, hence, would include a prompt given followed by two responses, such as these two, taken from Table 2:

You are really in very difficult situation. But, if I were you, I would choose what I'm interested in, because I think people cannot succeed in what they don't really hope.

I think it's better for you to study history because you might regret not having chosen to study it after you entered a college of medicine. I think it's important for you to decide on your own will.

[ p. 19 ]

Participants were to mark the scale below the two responses to show how much they believed the prompts differed or were similar with regard to appropriateness. A mark nearer the word "different" would mean the participant felt the two responses were very different with regard to appropriateness. A mark near the word "same" would indicate a smaller degree of difference.
The order of pages was systematically (not randomly) varied so that each booklet had a different order of presentation.
Procedure

Phase I: The printed data collection instrument was distributed to 113 participants. Five participants who did not wish to be included in the study or who did not fill out the informed consent agreement were eliminated from the study. One further participant was eliminated because it was not possible to remove identifying information from that person's response. (This last participant had some degree of personal fame for nonacademic endeavors and wrote about an incident having to do specifically with those activities.) Final participants numbered 107. Instruments were filled out during regular class sessions.
Data were entered into a computer in a single document. Data were reentered into a separate document with two modifications. First, grammar and spelling were regularized. This is a fairly common practice, used to prevent participants from focusing on mechanics, rather than message (Kasper, 1984, 2001; McDonald, 2000; Bardovi-Harlig and Dörnyei, 1998; Niezgoda & Röver, 2001). Second, details that could be used to identify participants were changed. If, for example, a participant named his or her home town, a different town, equidistant but in a different direction from Tokyo, was substituted. Other substitutions included changing hobbies or pastimes or references to places of work. Although it is possible that the situations as such could be used to identify participants (as with the second situation in Table 3), I did the Phase II data collection at universities different from those in Phase I, so that in no case was any situation shown to any participant who might be conversant with gossip that would allow identification of the persons described in the situations.
Situations involving family members were eliminated upon the reasoning that Japanese learners of English are not likely to communicate with their own family members in English.

Phase II: The printed data collection instrument was distributed to 122 participants. Four participants refused permission to use their responses for research purposes, reducing the final total to 118 participants.
Instruments were filled out during regular class sessions.
Data were entered into Microsoft Excel 2000 (1999) and later transferred to plain-text files. Information was then analyzed using the Winsteps item-response theory computer program (Linacre, 2002b) to try to detect poorly functioning items, which in this case would be prompts receiving inconsistent responses. After poorly functioning items were eliminated, data were examined using the multidimensional scaling computer program Permap (Heady & Lucas, 2002) to see if prompts were forming different clusters. Two items formed a cluster distinct from the remainder of the items and were eliminated.

Phase III: Six different printed data collection instruments, each containing four prompts, were distributed randomly to 113 participants. Instructions were given orally in English and Japanese and printed on the blackboard in English. Informed consent was obtained from all participants. Further informed consent was obtained orally only from the ten participants whose responses were selected for the Phase IV instruments.
The instruments were filled out during regular class sessions.
Responses were stored in Microsoft Excel 2000 (1999). Using the RAND() function, I selected one prompt and ten responses for use in Phase IV.

Phase IV: Data collection booklets were distributed to 30 participants. 18 of 30 booklets have been returned as of the date of writing.
Following the recommendations of Schiffman, Reynolds, and Young (1981), I did not allow participants time to ask questions about the instructions and refused to give any explanations beyond the printed instructions. This was done to prevent instructions or explanations from biasing the results. One participant refused to participate because of my refusal to explain or answer questions. The final N size was 18.
I, as the participant "Lars," filled out the instrument some five months after assembling the instruments and last reading any of the prompts. I did this, first, to see if my own judgments would be consistent and, second, to gain some insight into the difficulty of completing the instrument.
Participants were to indicate how similar or different two prompts were, as described above.
The distance of each response from the left of the scale was measured with a ruler (error: 0.5 mm) and divided by the length of the line, giving a measure of similarity. Marks made with circles were measured from the center of the circle; those with Xs from the intersection of the two lines; and those with check marks from the angle in the check mark; generally, I measured from the "center" of the mark. Marks made outside the line (atop the anchor words) were marked as 0% or 100% of the line length. Data were stored in Microsoft Excel 2000 (1999) and later transferred to various plain text files. These was one missing mark. This was replaced with the mean of that participant's scores because the older Indscal (Carroll & Chang, 1971) program I used in analysis does not run with missing values.
[ p. 20 ]

Data were analyzed under the Alscal and Indscal multidimensional scaling algorithms using SPSS versions 9.01 (1998) and 10.0.1 (1999), Indscal (Carroll & Chang, 1971), and Past.exe version 0.97 (Hammer & Harper, 2002).
The Alscal algorithm is used to examine how stimuli are treated by groups; the Indscal algorithm is used to examine how stimuli are treated by individual members of a group. In terms of this study, I was looking both at how the different pieces of advice were treated and how the participants differed from one another regarding how they use "appropriateness" to judge pieces of advice.
Multidimensional scaling has several strong points for the kind of work I am trying to do. First, it does not necessitate collecting data from large numbers of participants. Fewer than fifteen participants are often reliably used. Second, multidimensional scaling arranges stimuli on arbitrary dimensions according to the distance between those stimuli. This feature has the benefit of (a) allowing easy visual interpretation, (b) permitting the researcher to determine what those dimensions represent (or do not, as will be shown below), and (c) enabling useful information to be extracted from simple judgments of similarity or difference. The procedure, then, does not necessitate the creation of a scale to measure some construct before using it. It is a scale-construction procedure, which is exactly what I wanted, as I did not know what criteria participants might be using to judge. Multidimensional scaling also works happily with ordinally scaled data. Finally, and most importantly conceptually, multidimensional scaling does not necessitate normally distributed data and does not have an assumption that the underlying construct is normally distributed.
Both two- and three-dimensional solutions were obtained for both algorithms. I chose to analyze and present only the two-dimensional solutions because stress (the measure of error, in this case measured by S-stress) did not decrease much between the two- and three-dimensional solutions and because the two-dimensional solutions would be easier to analyze.
The results of the Alscal analysis were studied by reproducing the configuration of data points from the two dimensional solution using the ten prompt items and looking for conceptual patterns and by correlating various measurable aspects of the several prompts with dimension scores. These measures included number of words, number of characters, the Flesch reading ease index, and the Flesch-Kincaid reading level (all checked with Microsoft Word 2000). This was done with an eye to extracting common elements that could be used for an empiric definition of appropriate (when dealing with advice). Further analysis was done intuitively, by arranging scraps of paper containing the prompts on a large table in the layout shown in Figure 1. (The reader can reproduce this process by matching the prompts in Table 2 with the configuration shown in Figure 1.)
Indscal results were examined for meaningful patterns. This was done to see if the participants formed clusters that could be described by, for example, age or first language.

Results

Since Phases I through III simply comprised a starting point for Phase IV, this paper will focus on the Phase IV results. The means and standard deviations for the 45 comparisons made by the 18 participants in the final phase of the study are shown in Table 4 and Table 5, respectively.
Results of the two-dimensional Alscal modeling solution are shown in Table 3 and Figure 1. These results show how the ten prompts were treated by the 18 participants as a group. The results do not seem to be associated with any measurable properties of the stimuli. The relatively high S-stress values for all of the solutions (0.459 for the Indscal two-dimensional solution, for example) indicate a poor fit of the raw data to the model: a S-stress value of 0 would indicate perfect fit.
However, a conceptual examination of the Alscal configuration may be interpreted by arranging the results along two dimensions: the degree of emphasis on self-reliance in the prompt (in Figure 1, the x-axis, with a greater degree toward the left), and the degree of references to the consequences of possible actions (in Figure 1, the y-axis, with a greater degree toward the bottom).
Results of the two-dimensional Indscal modeling solution are shown in Table 1 and Figure 2. These results show the relative weights the 18 participants gave to the two dimensions in the solution. Also included are sundry characteristics of the participants.

Discussion and recommendations

Among several points to be noted:

Standard deviations are generally very large compared with means, showing great variation in the application of "appropriateness."

Participants with low scores on both dimensions either apply judgment criteria inconsistently or apply criteria completely different from other participants' criteria.

[ p. 21 ]

". . . 'appropriateness' seemed to be judged according to two principal criteria: degree of emphasis on self-reliance, and degree of reference to consequences"

Several intriguing results were obtained from this study: (a) some persons were more consistent than others; (b) "appropriateness" seemed to be judged according to two principal criteria: degree of emphasis on self-reliance, and degree of reference to consequences; (c) participants gave greatly different weights to these two criteria.
Result (c) is the most important. Figure 2 shows two distinct clusters of participants, but this seems to be an artifact of the particular and effectively arbitrary group of informants. Of the variables I am able to think of, none could be used to distinguish between the two clusters. These variables include first language, second language, academic specialty, age, gender, educational background, or amount of experience living outside Japan.
A key assumption of this study is that all Phase IV informants are competent in English and hence gave "correct" answers. I conclude that "appropriateness" is too unstable a concept across persons to be used for pragmalinguistic research with Likert-type or semantic differentiation scales, if the researcher's object is to obtain a picture of the behavior of naïve raters. The high S-stress values for both the Indscal and Alscal models are further indicative of the lack of agreement among participants regarding how they treated the notion of "appropriateness" in this study.
Note that this conclusion (c) is in accord with that of Benjamin (1992), who likewise concluded that any single given informant is not likely to be able to furnish reliable judgments.

Recommendations

Scales using descriptive anchor words should be avoided.

Traditional Likert scales may be preferable to semantic differential scales for judgment tasks.

Similarity of judgments across judges should not be assumed.

Similarity of judgments across stimuli should not be assumed.

Limitations and next steps

One of the principal limitations is the burden the Phase IV procedure puts on participants: one participant likened it to "psychological torture"; others commented on the difficulty of keeping a consistent set of standards in mind. When I myself tried the procedure I found this to be the case, even though I had developed the instrument and was well aware of how it was supposed to work.
A further burden is added by the time necessary to complete the instrument. Despite having developed it myself, it still took me 24 minutes to complete it.
Keeping a consistent set of standards in mind may well make participation difficult and stressful, but it may give credence to my tentative conclusion that "appropriateness" is not a workable word and cannot be satisfactorily operationally defined in such as way as to be consistent with actual use of the word.
Another major limitation is that all of the prompts were presented with no context whatsoever. Indeed, one potential participant refused to participate because of this lack, claiming, justifiably, that judgments of appropriateness are impossible to make without sufficient contextual information.
The other participants did manage to make appropriateness judgments, or at least make marks on the paper. I am not sure what this may mean: it could be considered to mean that people are able to make some decisions about appropriateness with little context. More likely, it may mean that participants are consciously or subconsciously supplying contexts in which to make the judgments. If such is the case, this study becomes worth much less with regard to conclusion (b).
This pilot study should be replicated in several different configurations:

The study should be replicated with a larger number of participants.

The study should be replicated with data generation procedures (such as ranking tasks) different from paired comparisons.

The study should be replicated using other potential or often-used anchor words, such as "polite."

References

Bardovi-Harlig, K., & Dörnyei, Z. (1998). Do language learners recognize pragmatic violations? Pragmatic versus grammatical awareness in instructed L2 learning. TESOL Quarterly, 32 (2), 233-262.

Benjamin, G. R. (1992). Perceptions of speaker's age in natural conversation in Japanese and English. Language Sciences, 14, 77-87.

Carroll, J. D., & Chang, J. J. (1970). Indscal. [Computer software]. Retreived from the World Wide Web at http://www.netlib.org/mds/indscal.f on 20 August 2002.

[ p. 22 ]

Dörnyei, Z. (2001). Teaching and researching motivation. Harlow, England: Pearson Education.

Hammer, Ø., & Harper, D. A. T. (2002). P A S T: PAlaeontological STatistics. [Computer software]. Retrieved from the World Wide Web at http://folk.uio.no/ohammer/past/ on 3 Nov. 2003.

Heady, R. B., & Lucas, J. L. (2002). Permap. [Computer software]. Retrieved from the World Wide Web at http://www.ucs.louisiana.edu/~rbh8900/permap.html on 16 August 2002.

Hudson, T., Detmer, E., & Brown, J. (1992). A framework for testing cross-cultural pragmatics. (Technical report #2). Honolulu: University of Hawai'i, Second Language Teaching and Curriculum Center.

Kasper, G. (1984). Pragmatic comprehension in learner-native speaker discourse. Language Learning, 34 (4), 1-20.

Kasper, G. (2001). Four perspectives on L2 pragmatic development. Applied Linguistics, 22, 502-530.

Linacre, J. M. (2002b). Winsteps [Computer software]. Chicago: Author.

McDonald, J. L. (2000). Grammaticality judgments in a second language: Influences of age of acquisition and native language. Applied Psycholinguistics, 21, 395-423.

McNamara, T. (1996). Measuring second language performance. London: Longman.

Microsoft Excel 2000. [Computer software]. (1999). Redmond, WA: Microsoft.

Molloy, H. P. L., & Shimura, M. (2002, September). Production and recognition difference in Japanese university students' English-language complaining. Paper presented at the JACET 41st Annual Convention, Tokyo.

Niezgoda, K., & Röver, C. (2001). Pragmatic and grammatical awareness: A function of the learning environment. In K. R. Rose & G. Kasper (Eds.), Pragmatics and language teaching (pp. 63-79). Cambridge: Cambridge University Press.

Schiffman, S. S., Reynolds, M. L., & Young, F. W. (1981). Introduction to multidimensional scaling: Theory, methods, and applications. New York: Academic Press.

SPSS version 9.01. [Computer software]. (1998). Chicago: SPSS.

SPSS version 10.0.1. [Computer software]. (1999). Chicago: SPSS.

2003 Pan SIG-Proceedings: Topic Index

Complete Pan SIG-Proceedings: Topic Index

[ p. 23 ]