Making Sense in Voice Interfaces:
Adaptation, grounding, and other means

for getting the message across.

 

Eva Jettmar

Dept. of Communication

Stanford University

 

Unpublished Manuscript

March, 1998


 

Introduction  

Recent advances in speech recognition and text-to-speech (TTS) applications have allowed such systems to become ubiquitous in everyday software products. Based on the idea that people subconsciously view mediated voices as social actors (Nass & Reeves, 1996), it can be assumed that users will prefer computer-generated voices and types of vocal interactions that are as similar as possible to human voices and human-human interactions.

Yet in current systems, voices still sound flat and far from natural, and planned vocal interactions between a computer and a user appear staged, cumbersome, and devoid of emotions. The design of current systems appears to be based on technological affordances rather than psychological principles. This might not be a fruitful approach to designing truly satisfying systems.

Since humans are experts in human-human interaction, but novices in human-computer interaction, voice interfaces which follow "natural" communication rules and resemble human-human interaction might provide a step towards more satisfying systems. Therefore, guidelines for designing realistic synthesized voice interactions should be based on knowledge about humans' vocal interactions.

When two people talk to each other, what strategies do they use to make sure they understand each other? Specifically, which factors in vocal interaction determine whether or not understanding is achieved? Consider the following example (from: Clark & Brennan, 1991):

       Alan:     Now, -um, do you have a j- car
       Barbara:  -have a car?

       Alan:       yeah

       Barbara: no-.

Here, it is obvious that several strategies are employed to ensure understanding, learn about possible misunderstandings, and repair mistakes. However, it is not clear which cues specifically are used by the human interactants in order to accomplish this. Findings from Neurobiology, Psychology, and Speech and Nonverbal Communication may help shed light on a topic which has not yet been formally acknowledged by designers of voice interaction systems.
 

1. Neurobiology

Knowledge about how complex sounds are perceived and processed by humans might help to make sure that synthesized voice provides exactly the cues which are most important for human listeners. While voice is perceived as a "Gestalt", multiple cues are processed by humans. Moore (1995) notes that human speech has an extraordinary degree of complexity in its spectral richness and time variations, and humans are capable of detecting minute differences in sound patterns. Vowels in the range of 80-400 Hz. and modulations of voice pitch are easily identified by humans, and frequencies alone provide sufficient information for correct identification when only few vowels are present (as in short utterances), while fine timing is used as an additional cue to decode longer and more complex utterances. Differences of only 2 Hz are sufficient for the human brain to distinguish between different voices. The main other cues used to distinguish between different voice types (normal, vocal, fry, falsetto, breathy, whisper, harsh) are the baseline frequency, frequency range, vibration (timbre), aspiration, and location of source.

For voice applications, this suggests that naturalistic expression of vowel frequencies is paramount, while more attention should be given to correct fine timing of voice output when sentences are long and complex, and different "actors" could be used. A reason why it is extremely difficult to reproduce realistic speech sounds is the failure of speech to meet the conditions of linearity and invariance (Chomsky and Miller, cited in Mullenix), and its "smearing" of acoustic information without segmenting information corresponding to linguistic units. Yet it is exactly these features that let people recognize speech immediately and that prompt appropriate processing.

Numerous studies (cited in Mullenix and Pisoni, 1989) suggest that categorical perception of speech with intimate links between production and perception is a robust phenomenon. The brain's speech perception mode is radically distinct from perception of non-speech (e.g., visual perception mode). Lyon and Waengler (1976), for example, found that human decoding of speech is a different mental process than decoding of written language. They found that decoding spoken language involves "internalized speech", producing a reduced and condensed version of what was said in order to grasp the content. This process might be analogous to internal visualization of written information. These findings suggest that identical text might be processed differently in TTS and written output.  Similarly, Mullenix and Pisoni (1989) argue that speech is processed radically different from other sounds in that higher-level sources of information (phonological, lexical, syntactic, semantic) interact with and affect lower level perception.

In addition, Debus (1978) found that sounds are immediately judged by humans in terms of their emotional valence, and that these judgments are used for further processing of the sounds, which might work as positive or negative reinforcement. The valence of sounds can also determine the level of activation caused in the brain.

In an experiment about the effects of producing sounds of a specific valence, Hatfield and Hsee (1995) found that the production of sounds of a certain valence and nature (e.g., joyful, angry, sad) put subjects in the same emotional state, even if they did not know which mood was contained in the sounds they were asked to express. For voice systems, this means that it might be possible to affect the user's emotional state by prompting him/her for input that suggests a certain emotional valence and nature (e.g., prompting the user to express his/her answers in a formal way might make the user feel more serious about the interaction).

Evidence also exists that when processing voice input, permissible phonological sequences and permissible lexical content affect perception (Massaro and Cohen, cited in Mullenix and Pisoni, 1989). Perception of speech involves complex interactions with higher-order knowledge concerning the linguistic structure of language, as well as interactions with long-term memory representations of linguistic entities (Mullenix and Pisoni, 1989). For TTS technologies, this means that even if the voice quality is not perfect, the use of common phonological sequences can suggest fluent natural language to the user, and if some words cannot be pronounced clearly, users will be able to identify the intended meaning.
 

2. Nonverbal and Interpersonal Communication

 While neurobiological approaches are mainly concerned with cognitive processing of verbal messages, research in interpersonal and nonverbal communication focuses on the coding and decoding of expressive messages humans use in order to reach understanding and on communicative goals in conversation.

Research on the verbal aspects of regulating behaviors and message construction provides helpful insights for the design of TTS systems; however, interestingly, it is nonverbal communication the importance of which for conversational control seems to have been underrated and will be addressed in this part.  This seems counterintuitive for voice interaction. However, it has to be remembered that vocalics (nonverbal voice utterances, pitch modulations, and fine timing) are part of nonverbal communication, and typical voice interactions include kinesic aspects such as pointing.

Nonverbal behavior is believed to account for as much as two-thirds of the meaning in any social interaction in which such cues are present (Burgoon, 1985), and lack of eye contact, for example, can prevent interactants from receiving relevant social information (Andersen & Coussoule, 1980). Perceptions of nonverbal backchannel behaviors are positively related to perceived understanding (Stelzner et al., 1994), and the level of nonverbal involvement can provide an important indicator of the quality of a relationship  (Patterson, 1988).

When comparing verbal and nonverbal communication, it seems evident that verbal communication is best suited for conveying complex information, while purely nonverbal communication is best suited for conveying emotional messages (Andersen, 1985). However, typically, both systems are employed simultaneously in human interaction to complement each other and provide redundancy.

Redundant cues have been demonstrated to be important contributors to accurate decoding processes (De Paulo & Rosenthal, 1988): When cues from multiple channels were consistent, subjects in experiments were markedly more accurate in decoding emotional cues in messages (Rosenthal et al., 1979, cited in Depaulo & Rosenthal, 1988); however, discrepant cues are perceived as extremely disturbing and might lead to inaccurate decoding, or in perceptions of the speaker as being insincere and untruthful (De Paulo & Rosenthal, 1988).

 It has been demonstrated that when both verbal and nonverbal cues (such as vocalics, body posture, or facial expressions) are present, nonverbal cues are believed over verbal cues (De Paulo & Rosenthal, 1988) by the message recipient. The reason for this is that because it is more difficult for humans to control their nonverbal behaviors, these behaviors are taken to be sincere expressions (De Paulo & Rosenthal, 1988). Similarly, when both video and audio channels are present, information from visual channels is believed over auditory information (ìseeing is believing!î).  Experiments have shown that the most important information in conversations is inferred from facial expressions, with the smile being the most important cue (De Paulo & Rosenthal, 1988).

For TTS systems, this means that for desktop applications, it might be helpful to provide a face in addition to the voice; if this is not possible, satisfactory communication is still possible (as in telephone conversations), but it seems important that nonverbal voice characteristics, such as vocalic utterances and pitch modulations, are added to complement the verbal expression of the system. In addition, nonverbal cues are used to perform conversational control functions such as regulating turn-taking and indicating success in conveying information. Research shows that the majority of regulating behaviors in communication is nonverbal (e.g., headnods).

These cues define the meaning of the verbal cues and thus serve as contextual cues. It has been demonstrated that monitoring a conversational partnerís nonverbal expressions is a continuos process (Rosenfeld, 1985). Backchannel cues, such as ìm-hmî and ìu-huî, are important nonlinguistic indicators for whether or not a message has been understood by the recipient, whether or not the receiver agrees with the statement, and for attentiveness. These cues also indicate that the receiver wishes to stay in the listener role (Rosenfeld, 1985).

Since minimally meaningful units of conversational information are structured in units larger than a single word (Goldman-Eisler, 1972, cited in Rosenfeld, 1985), the problem of determining individual units is solved mainly by strategic pausing, a strategy which is also used to indicate desired speaker switching.  Similarly, an emphasizing stress called a ìphonemic clauseî is used approximately every 14 words to elicit backchanneling behaviors from the listener. Phonemic clauses with falling pitch absolutely require a response by the listener; failure to respond has been observed to be extremely bothersome to the speaker, since it indicates the listenerís inattention or disagreement (Rosenfeld, 1985).

For voice technology, this means that it is of utmost importance to equip synthetic speech with the capacity to employ strategies such as pausing and phonemic clauses in meaningful ways. In addition, the system should allow for the userís expression of backchannel cues, and should itself provide backchannel cues when the voice recognition system detects the expression of a phonemic clause by the user. The userís utterance style could also be used to detect when the user would like to switch speaker-listener roles, and this information could be used by the system to streamline the conversation.

Numerous studies also indicate that gesture is typically closely coordinated with gaze and speech and complements speech in face-to-face conversations (Streek, 1993).  Gesture and speech are viewed as complementary components of a single process of utterance and gestures are described as integral symbolic components of communication processes, while gaze and gaze shifts can serve as  markers of the meaning or communicative relevance of a gesture (Kendon, 1980, cited by Streek). Gestures, in addition, ìprepare the sceneî for further talk and action, and it is interesting to note that they typically occur slightly before the speech act that they refer to.

Research on telephone conversations, which might be more similar to human-computer voice interaction, indicates that in telephone conversations, nonverbal cues such as eye gaze, hand gestures, and head nods, which are usually closely tied in with vocalic strategies, are frequently replaced with vocal cues, and the interactants use a different set of turn-taking and other control strategies than in face-to-face interactions (Dittman and Liewellyn, 1967, cited in Rosenfeld, 1985). For example, listeners wait for the definite end of a sentence (ìfinal junctureî) until they start their reply. This, again, points to the importance of nonlinguistic vocalic cues to control communication. Another way to analyze the kinds of interactions that different media afford stems from theories such as Social Presence Theory, (Short, Williams, & Christie, 1976), Media Richness Theory (Draft, Lengel, & Trevino, 1987), and the "cues-filtered-out" approach (Culnan & Markus, 1987), which have previously been used to describe the differences between face-to-face and mediated communication settings.

According to these theories, synthesized speech without vocalics would be considered an extremely "lean" medium with low social presence, and most cues ìfiltered outî.  Early studies found communication in such media to be more task-oriented, less emotional, and less personal (e.g. Hiltz, Johnson, & Turoff, 1986) than face to face interactions. Later, however, studies found that humans seem to be able to adapt their communication style to restricted media environments over time, so that comparable, though not equal, levels of communication satisfaction are achieved. It has to be noted, however, that communicative processes such as impression formation (see Ruscher, 1996), were still found to be retarded in restricted media (Walther, 1993).

For voice technology, these findings could mean that users should be given time to adapt to using a new system, and it should be expected that it might take longer to reach understanding in voice-driven human-computer interaction. In addition, this also suggests that as long as vocalics are not present in voice output, voice output would be best suited for task oriented communication, as well as for stereotypical content, since impression formation is faster and easier for stereotypical content (Ruscher, 1996).
 

3. Psychology

In addition, the question of how people define common ground in communication is an important one to address in the context of voice technology in order to determine what strategies could be used by voice systems to avoid misunderstandings, ìdead endsî, awkward situations, and frustration. When people communicate, they adapt to each other. Laver (1993) found that humans continually adapt the prosodic (pitch, tempo) and structural aspects of their speech to the perceived needs of their listeners, and he proposes principles for incorporating intelligent adaptivity in the operation of text-to-speech conversion systems to improve communication with human partners.  

Similarly, Gregory (1986) found that paralinguistic adaptation in communication is so strong even for voice frequency alone that a computer is able to pick out separated communication dyads out of a multitude of voices by just looking at the adaptation in conversation. In addition, results from a study comparing human-human and human-computer spoken dialogues show that the only significant difference between HH and HC speech was in the area of turn-taking (discussed in part 2.) and grounding.

Grounding in Communication

According to Clark and Brennan (1991), the process of grounding refers to establishing common lexical ground that is good enough for current purposes between the interactants in conversation. What, then, does it take to constructively contribute to conversation? How can people make sure that they are talking about the same thing, and how can they make sure they are being understood? In short, how do people get the message across?

For once, for people to contribute to discourse, a basic requirement is that they add to the ìcommon groundî in an orderly way: For each utterance, the mutual belief that the addressee has understood the content should precede further utterances (Clark, 1989). This process usually starts with the speakerís presentation of a noun phrase which should be repaired, expanded on, or replaced in an iterative process until a mutually accepted version is jointly found (Clark, 1986). Thus, two phases can be distinguished in the grounding process: The presentation phase and the acceptance phase in which all the participants work together to establish the mutual belief that everyone has understood the presented stretch of speech well enough for current purposes (Clark, 1987).

It is important to mention that problems of participants in conversation are really joint problems and have to be managed jointly through prevention of foreseeable but avoidable problems, warning of partners of such problems, and repairing problems that have arisen. These strategies can be employed at each of the levels of decoding an encoding 1. words, 2. stretches of speech, and 3. meaning. For computers, this suggests that in addition to repair strategies (discussed below), strategies to avoid errors should be employed, such as rephrasing words which sound very much like other words.

A side-product of this joint effort is that when people in conversation refer repeatedly to the same object, they come to use the same lexical terms to describe the object. This phenomenon, called lexical entrainment, helps establish a shared conceptualization or ìconceptual pactî which can be referred to in later references (Brennan, 1996). Such pacts might be important for voice interfaces: once a system and a user have agreed upon a commonly understood term, the system should remember that term and use it in the future. Clark and Brennan (1991) also discuss several principles that define the grounding process:
 

a. The Principle of least collaborative Effort

Interactants usually try to spend the least combined effort on the grounding process; for example, people correct errors the partner made; correction of errors requires cooperation. An example for this would be:

Eva: And then I had to help -um... whatís-his-name... ?
  Cliff: Byron?

  Eva: Yes, Byron. I had to help Byron.

 

b. The Contribution Model

Interactants contribute to the grounding process by giving negative (ìexcuse me?î) and positive (ìgoshî, ìI seeî) evidence, by employing strategies such as repetition of core facts, using adjacency pairs (e.g. question-answer), and by using turn-taking skills.
 

c. Grounding References

In addition to repetitions, alternative descriptions (e.g., ìmy carî, and ìthe Mustangî), indicative gestures (e.g., pointing), referential installments (e.g. ìopen the can - the small one in the corner - the one that says ëGreek olivesí ì), and trial references (e.g. ìI took my car... - see the one in the corner?î - ìyeahî -  ì... to campus today...î) serve as grounding references.
 

d. Grounding changes with the Medium

How grounding is accomplished changes dramatically with the medium, as media vary on many dimensions that affect grounding (Clark and Brennan, 1991).

The dimensions that affect choice of grounding strategies are: Copresence of interactants, visibility and audibility of partner, contemporality (interacting in real-time), simultaneity (being able to send and receive simultaneously), sequentiality (Aís and Bís turns cannot get out of sequence), reviewability of partnerís message, and revisability of own message before it is sent.
It has been established that the ìcostsî of grounding interact with the medium: for voice technologies, possible relevant costs that should be minimized are:

1. Startup costs - since defining initial common ground in voice media requires large investments of time and effort, user profiles of existing users or customers should be used to avoid the same investments in subsequent interactions.

2. Display costs - since pointing is not possible in voice interaction, software might provide diagrams for complex content, and alternate grounding strategies should replace pointing to avoid errors.

3. Fault and Repair costs: Repairing errors that have been made (e.g., clarifying misunderstandings) requires a lot of effort; therefore, errors should be corrected as soon as they occur, since they tend to snowball (Clark and Brennan, 1991). This means that a voice interface should offer easy-to-use, comprehensive ëhelpí, and other backup strategies which users could use to correct errors. In addition, the system should respond to usersí utterances of ìwhat?î, ìno!î, and ìdish, not fish!î and similar. These inferences for voice interaction design lead to general design implications that can be drawn from the reviewed literature.
 

4. Technical Feasibility of Voice Adaptation in Interfaces

Examples from research in artificial intelligence and adaptive agent technology show that even though the creation of highly adaptive voice interfaces seems technically challenging, it is very possible even by current standards. Systems, for example, have been proposed that can classify ìillocutionsî, or speech acts, into the categories of assertives, directives, commissives, permissives, prohibitives, and declaratives and can furthermore take into account the state of the ìworldî that is commanded or promised, and combine this information to create speech output that suggests ìcommunication competence (Spitzberg, 1992), i.e., that is  appropriate and effective (Singh, 1998). In addition, such systems can use measures of the userís satisfaction with the speech act in order to fine-tune voice output and optimize user satisfaction.

Several existing AI logic principles can easily be effectively used by the system to determine what to say and how to say it (Singh, 1998). Currently, there systems exist in which agents utilize voice pitch which is appropriate for the context and meaning of a speech act (e.g. strategically rising the voice, stressing important words, and pausing in accordance with the content). A the same time, these systems employ animated agents which use hand gestures and facial expressions which are synchronized with both the speech output and the content of the message.

For example, a square is drawn in the air to emphasize the word ìbank checkî, since it is the most important word in the sentence; a hand gesture in the appropriate direction is used for ìover thereî. (Cassell et al, 1998). Systems like these have already accomplished the important task of synchronizing text with natural speech and nonverbal cues, and they demonstrate the feasibility of truly adaptive voice interfaces.

5. Conclusion

The social-science findings from this review together with the knowledge that creating intelligent, responsive voice systems is technically feasible, easily translate to implications for the design of voice interfaces which are more natural and which employ human, rather than technical, rules for communication. Such interfaces should be easier and more pleasant to use, since the interaction they afford would strongly resemble natural human language and discourse. The final section of this paper will therefore be devoted to a simple, hands-on expression of design implications for the design of voice-based systems which can be readily used by designers of such systems.
 

Twelve Design Rules for Voice Interfaces

1. Get the sound frequency right - humans are too good at detecting you got it wrong, and they donít like voices that sound ìwrongî.

2. Itís ok to use several ìactorsî - humans are very good at distinguishing different voices, even if they are similar.

3. When presenting complex content, use correct fine timing to make the message clear.

4. Be aware that speech may be processed differently than written text, so you might have to change the text for the spoken version.

5. When prompting the user for voice input, be aware that what he is prompted to say will affect his mood.

6. Add vocalics - nonverbal vocal elements such as appropriate pitch modulations and u-hu sounds - as well as strategic pausing, phonemic clauses, and backchannel cues to the voice output whenever appropriate, possibly even a face.

7. Be consistent among the cues you provide - e.g., donít mix serious content with playful voices.

8. Give users time to adapt to using your system: Have a fun-to-use tutorial ready.

9. Be aware that it may take longer to reach understanding in voice interaction, so plan for repetitions and similar strategies, if requested by the user.

10. Use synonyms for words that cannot be expressed unambiguously
(e.g. ìthreeî and ìfreeî - use ìcomplimentaryî instead of ìfreeî).

11. Keep a log of each userís conceptual pacts with the system that have already been established, and use the structures which have been established in order to avoid repeated ìstartup costsî.

12. React to the userís ìu-huî (use them as turn-taking cues), ìno!î (signals that the system misunderstood the user), ìwait!î (signals that the user needs time) and ìwhat?î (signals user did not understand or is confused) and similar utterances.
 

 

References

 Andersen, P.A. (1985). Nonverbal immediacy in interpersonal communication. In A.W. Siegman & S. Feldstein (Eds.) Multichannel Integrations of Nonverbal Behavior (pp. 1-36). Hillsdale, NJ: Lawrence Erlbaum.
 Andersen, P.A., & Coussoule, A.R. (1980).  The perceptual world of the communication apprehensive: The effect of communication apprehension and interpersonal gaze on interpersonal perception.  Communication Quarterly, 28, 44-54.

 Brennan, S., & Clark, H. (1997). Conceptual Pacts and lexical choice in conversation. Journal of Experimental Psychology vol. 22 (6), 1482-1493.

 Burgoon, J.K.  (1985).  Nonverbal signals. In M.L. Knapp & G.R. Miller (Eds.),  Handbook on interpersonal communication. (pp. 344-392).  Beverly Hills, CA:  Sage.

 Cassell, J. et al. (1998). Animated Conversation: Rule-Based generation of Facial Expression, gesture, and Spoken Intonation for Multiple Conversational Agents. In: Huhns & Singh (eds.). Readings in Agents. San Francisco, CA: Morgan Kaufmann Publishers.

 Clark, H. (1994). Managing Problems in Speaking. Speech Communication vol 15 (3-4), 243-250.

 Clark, H. (1989). Contributing to Discourse. Cognitive Science vol. 13 (2), 259-294.

 Clark, H. (1986). Referring as a collaborative Process. Cognition, vol 22 (1), 1-39.

 Clark, H. (1987). Collaborating on contributions to conversations. Language and Cognitive Processes vol 2 (1), 19-41.

 Clark, H. & Brennan, S. (1991). Grounding in Communication. In L.B. Resnick, J.M. Levine, and S.D. Teasley (eds), Perspectives on socially shared cognition (127-149). Washington, DC: APA.

 Culnan, M.J. & Markus, M.L. (1987) Information technologies. In F.M. Jablin, L.L. Putnam, K.H. Roberts, & L.W. Porter (Eds.), Handbook of organizational communication: An interdisciplinary perspective (pp. 420-443). Newbury Park, CA: Sage.

 Debus, G. (1978). Über Wirkungen akustischer Reize mit unterschiedlicher emotionaler Valenz. Meisenheim, Germany: Hain.

 DePaulo, B., & Rosenthal, R. (1988).  Ambivalence, Discrepancy and Deception in Nonverbal Communication.

 Draft, R.L., Lengel, R.H., & Trevino, L.K.  (1987).  Message equivocality, media selection, and manager performance: Implications for information systems. MIS Quarterly, 11, 355-366.

 Gregory, S. (1986). Social Psychological Implications of voice frequency correlations: Analyzing conversation partner adaptation by computer. Social Psychology Quarterly vol 49 (3), 237-246.

 Hatfield, E., Hsee, C., and Costello, J. (1995).  The Impact of Vocal Feedback on Emotional Experience and Expression. Journal of Social Behavior and Personality, vol. 10, no. 2, 239-312.

 Hiltz, S.R., Johnson, K., & Turoff, M. (1986).  Experiments in group decision making: Communication process and outcome in face to face versus computerized conferences.  Human Communication Research, 13, 225-252.

 Johnston, A. (1995). There was a long pause: Influencing turn-taking behaviour in human-human and human-computer spoken dialogues. International Journal of Human-Computer Studies vol 42 (4), 383-411.

 Laver, J. (1993). Repetition and re-start strategies for prosody in text-to-speech conversion systems. Speech Communication vol 13 (1-2), 75-85.

 Lea, M., & Spears, R. (1995). Love at first byte? Building personal relationships over computer networks. In: Wood, J. & Duck, S. Understanding relationships: Off the beaten track. (pp. 197- 233).

Thousand Oaks, CA: SAGE.

 Lyon, G. & Waengler, H. (1976). Covert Articulations in Adult Listeners. Hamburg, Germany: Buske.

 Malandro, L.A., & Barker, L.  (1983).  Nonverbal Communication.  Menlo Park, CA: Addison-Wesley.

 Moore, B.C. (ed.) (1995). Hearing. Cambridge, GB: Academic Press.

 Mullenix, J. D. & Pisoni, D. B. (1989). Speech Perception: Analysis of biologically significant signals. In: R. Dooling and S. Hulse (eds.). The Comparative Psychology of Audition: Perceiving Complex Sounds. Hillsdale, NJ: Lawrence Erlbaum Associates.

 Patterson, M.L.  (1988).  Functions of nonverbal behavior in close relationships.  In S. Duck (Ed.) Handbook of personal relationships: theory, research and interventions (pp. 41-56). New York: Wiley.

 Rosenfeld, H. (1985). Conversational Control Functions of Nonverbal Behavior.

 Ruscher, J. (1996). Forming Shared Impressions through Conversation: An adaptation of the Continuum Model. Personality and Social Psychology Bulletin vol. 22 (7), 705-720.

 Short, J., Williams, E, & Christie, B.  (1976).  The social psychology of telecommunications. London: Wiley.

 Singh, M. (1998). A Semantics for Speech Acts. In: Huhns & Singh (eds.). Readings in Agents. San Francisco, CA: Morgan Kaufmann Publishers.

 Stelzner, M.A., Egland, K.L., Andersen, P.A. & Spitzberg, B. (1994, November).  Perceived understanding, nonverbal communication, and relational satisfaction. Paper presented at the annual convention of the Speech Communication Association, New Orleans, LA.

 Streek, J. (1993). Gesture as Communication: Its Coordination with Gaze and Speech. Communication Monographs, vol.60, pp. 275-297.

 Walther, J.B.  (1993).  Impression development in computer mediated interaction.  Western Journal of Communication, 57, 381-398.

2004