Chapter 7 completes the discussion of characters’ basic socialequipment with discussion of the voice—the rich messages that people convey through how they say things.1 The chapter includ
Trang 1This page intentionally left blank
Trang 2CHAPTER Seven
The Voice
Chapters 5 and 6 were about social cues that engage the eyes This chapter isdevoted to the ear Chapter 7 completes the discussion of characters’ basic socialequipment with discussion of the voice—the rich messages that people convey
through how they say things.1 The chapter includes an overview of the kinds ofsocial cues that the voice conveys, with many listenable examples from games
(including Warcraft III: Reign of Chaos, Final Fantasy X, The Sims™, Grim Fandango, and Curse of Monkey Island), and offers design tips for considering the aural side of
social signals when crafting character voices Chapter 7 also includes discussion ofsome future-facing voice technology and an interview with two pioneers in usingemotion detection from voice cues to adjust interfaces
Before reading this section, take a moment to listen to the first two voice samples
on the DVD (Clips 7.1 and 7.2) While listening to each person, try to form amental picture: How old are they? What gender? Is this person of high or lowstatus? Are they in a good or bad mood? Then see Section 7.9 for photos of thespeakers Most likely you correctly identified the majority of these visible traitsfrom voice alone
Listening to a person’s voice on the telephone, you can often make a good guessabout age, gender, social status, mood, and other characteristics without any visualcues to help Even if the person is speaking another language and you cannotunderstand the meaning of the speech, you can still get pretty far in assessing thesequalities How is this possible?
Researchers point to the evolutionary roots of speech in the grunts and calls ofour primate ancestors There are striking similarities in the vocal characteristics offright, anger, and dominance, among other social cues, when one compares primate
1Analyzing the social meaning of what characters say moves into the territory of linguistics,
Trang 3and human voices Researchers who asked participants in a study to listen to malemacaque monkeys found that more than 80% of the listeners could accuratelyidentify what the dominance calls meant (Tusing and Dillard 2000, 149).
Social scientists refer to the information that is not conveyed by the words
in speech as paralinguistic cues A large proportion of the meaning in everyday
conversation emerges through paralinguistic cues—shifts in voice quality whilespeaking, pauses, grunts, and other nonlinguistic utterances Paralinguistic cuesplay an even bigger role in communication between people who already knoweach other well—a well-placed sigh or lack of a heartfelt tone conveys volumes
To make characters seem richly human in their communication, then, a designer
should have a solid understanding of what they are conveying with how they
say things
7.2.1 The Mechanics of Speech
To speak, a person pushes air from the lungs through the larynx, mouth, and nose.The pitch and the qualities of sounds are affected in two different ways: phonation
and articulation Phonation is the way a person moves the larynx itself to make the
initial sound When the shape of the muscle folds in the larynx (which used to
be called the vocal cords) is altered, it produces different sound pitches (called
fundamental frequencies) and also different sound qualities, such as breathiness or
harshness These qualities can shift due to a person’s emotional state—tenseness,
tiredness, depression, and excitement all can have effects on phonation
Articula-tion is when a person uses the natural resonance of the mouth, nose, and even of
the chest cavity, as well as moving the tongue and lips and palate, to alter thesound as it comes out People are very sensitive to shifts in articulation—forexample, a person can “hear” a smile in another’s voice, in part because the shift
in lip shape when speaking affects the articulation of the sound (see Figure 7.1).Listen to audio Clips 7.3 and 7.4 on the DVD Can you tell which recording wasmade while smiling? See Section 7.9 for the answer
As with facial expression and gesture, some of what people hear in others’voices comes from their physical qualities and their body’s involuntary reactions tocircumstances Some comes from learned strategies and responses to social circum-stances For example, gender and age come across in voice because of physicalqualities of the person’s vocal equipment itself (which can be a problem for peoplewhose voices fall outside the usual range for their gender or age group) Mood andemotion are signalled involuntarily (at least in part) because of changes in vocalproduction as the person’s nervous system reacts—for example, the dry mouth andspeedier heart rate of anxiety also have effects on the muscles in the larynx and onbreathing itself However, a person can also mold the tone of his or her voice insome ways, adopting a pacifying, pleading, arrogant, or neutral tone of voice using
intonation and rhythm (referred to as prosody by researchers) Failing to adopt the
proper social tone of voice is a communication in and of itself
Trang 47.2 THE PSYCHOLOGICAL PRINCIPLES
7.2.2 The Social Signals in Voice
Emotion in Voice
Emotions underpin decision-making, including social action and reaction (see
[Damasio 1994] for a fascinating account of the role of emotions in thinking)
Know-ing that another person is angry is crucial to understandKnow-ing how they are interpretKnow-ing
social actions and the world at large, thus helping to predict what they might do
next Failure to recognize emotional expression is a serious liability in human
inter-action—it is in fact a symptom of some disorders in the autism spectrum
Chapter 5 touched upon Ekman and colleagues’ work on recognizing facialexpressions of emotion There has also been extensive work on the expression
of emotion in the voice Voice researchers have found, when they look for
con-sistent signatures of emotions, that there are clear patterns (see [Kappas, Hess, and
Tension and movements
Lips Tongue
Pharynx
Height of larynx Tension and shape of folds Regularity of fold vibration
Subglottal air pressure
Regularity, depth Respiration
Articulation
Phonation
Cortex
Limbic system
Limbic system
Jaw Pharynx
Both phonation— the action of the larynx (vocal cords) and articulation— the shaping of the
mouth, tongue, and lips—create the subtle alterations in tone that carry social and emotional
information (based on Kappas, Hess, and Scherer 1991).
F I G U R E
7.1
Trang 5Scherer 1991; Cahn 1990; and Burkhardt and Sendlmeier 2000] for more detail onthe taxonomy that follows):
• Anger (hot) Tense voice, faster speech rate, higher pitch, broader pitch range
• Anger (cold) Tense voice, faster speech rate, higher fundamental frequency and
intensity, tendency toward downward-directed intonation contours
• Joy Faster speech rate, raised pitch, broader pitch range, rising pitch pattern
• Fear Raised pitch, faster speech rate, broadened range, high-frequency energy
• Boredom Slower speech rate, additional lengthening of stressed syllables, lowered
pitch, reduced pitch range and variability
• Sadness (crying despair) Slower speech rate, raised pitch, narrowed pitch range,
narrowed variability
• Sadness (quiet sorrow) Slower speech rate, lowered pitch, narrower pitch range,
narrower variability, downward-directed contours, lower mean intensity, lessprecision of articulation
• Depression Lower intensity and dynamic range, downward contours
Notice the similarities among emotions—fear, anger, and joy all seem to besignalled by faster speech, higher pitch, and more range In contrast, quiet sadness,depression, and boredom share slowing of pace, lower pitch, and less variability.These effects can be traced back to what is going on in the person’s nervoussystem The arousal of a person’s sympathetic nervous system, which causes thingslike increased heart rate and sweating, also causes these changes in the voice.When a person’s parasympathetic system, which decreases blood pressure andslows heartrate, moves into action, it also shifts what happens in the voice itself
A glance back at the Laban movement graphs in Chapter 6 (Figures 6.12 and 6.13)shows that body movement style also seems to be modulated in this way
So how do people learn to tell apart the high-energy or low-energy emotions in thevoice? Certainly they use context contributed by the words themselves, but people arealso able to tell what position the mouth is in, based upon sound As mentionedabove, a person can “hear” a smile Voice researchers have detected different patterns
of intonation as well—such as the characteristic rising pitch pattern of joy It is alsothe case that people acclimate to one another’s vocal patterns—knowing someonewell includes knowing how they, in particular, signal sadness or joy with their voice.One game that takes full advantage of the power of paralinguistic cues in con-
veying emotion is The Sims™ Sim characters speak to one another, but their words
are entirely incomprehensible Simlish may be gibberish, but it is laden with tional signals, and it allows the player to draw conclusions about how his or herSim is feeling in general, and in relation to other Sim characters (see Figure 7.2).For example, listen to Clip 7.5 As the Sim characters move from joy to jealousy, it iseasy to follow along despite the lack of words
Trang 6emo-7.2 THE PSYCHOLOGICAL PRINCIPLES
Interestingly, researchers have found connections between the expression ofemotion in the human voice and strategies for evoking emotion through music Two
researchers performed a meta-analysis of research that had been done on evoking
specific emotions with music, with work on emotions in speech Data “strongly
sug-gest that there are emotion-specific patterns of acoustic cues that can be used to
communicate discrete emotions in both vocal and musical expressions of emotion”
(Juslin and Laukka 2003, 799) This makes sense if one considers that the playing of
musical instruments tends to reflect the muscular tension and general arousal state
of the performer—creating a bridge to the listener into a particular emotional state
Some games, for example, Grim Fandango, make use of this connection between
music and emotion to heighten the player’s experience of a character’s emotional
reactions See Figure 7.3 and Clip 7.6, in which Manny’s boss berates him for a
The Sim language—“Simlish”—uses paralinguistic cues of emotion (listen to Clip 7.5) The
Sims™ Unleashed image ©2005 Electronic Arts Inc The Sims is a registered trademark of
Electronic Arts Inc in the U.S and other countries All rights reserved.
Grim Fandago uses music to heighten the player’s reaction to an NPC’s tirade (listen to Clip 7.6).
© 1998 Lucasfilm Entertainment Company Ltd All rights reserved.
F I G U R E
7.2
F I G U R E
7.3
Trang 7mistake Notice the music in the background, which displays some of the sameaural qualities as the boss’s tirade.
Social Context and Identity
As researchers begin to assemble a more detailed picture of how the voice tributes to social interaction, one thing they are realizing is that emotion is notnecessarily the predominant message communicated Researchers in Japan whogathered a large body of recorded speech by asking people to wear headsets around
con-in everyday life, found few examples of strong emotion con-in voices Day to day, ple tended to keep their emotional reactions mostly to themselves What did show
peo-up were big differences in patterns of voice depending peo-upon who the person wasspeaking to—adjustment based on social roles and relationships (Campbell 2004)and, of course, individual differences in vocal style that emerged from each person’sown personality and physical qualities
Some traces of social roles and relationships in voices can be broken down alongthe dimensions first discussed in Chapter 2: cues of dominance and of friendliness.People demonstrating dominance tend to lower their voice somewhat and toconstruct shorter utterances in general They may sometimes speak more loudly,depending upon the situation Showing submission with voice involves using asofter, more highly pitched voice, and subordinates tend to say more As was men-tioned earlier in this chapter, these general vocal contours of dominance are true ofother primates as well as people Clips 7.7 and 7.8 demonstrate the differencebetween dominant and submissive voices Although the butler (Raoul) is initiallyvery dominant, he moves to submissive obsequiousness in the second clip once
Manny has a pass to the VIP lounge (see Figure 7.4) In general, Grim Fandango
Manny tries to gain entrance to the VIP lounge (from Grim Fandango) Listen to Clips 7.7 and 7.8
for the shift in the butler’s voice once he knows Manny will be admitted © 1998 Lucasfilm Entertainment Company Ltd All rights reserved.
F I G U R E
7.4
Trang 87.2 THE PSYCHOLOGICAL PRINCIPLES
makes brilliant use of vocal dominance cues to heighten comic effect Other
exam-ples of games that use dominance cues in similar ways are The Curse of Monkey
Island (Figure 7.5, Clip 7.9) and Warcraft III (Figure 7.6, Clip 7.10).
Friendliness is shown in various ways Meeting with a friend usually leads towarmth and energy in the voice, the signals of joy Conversation among close
friends includes more range of emotion than between more distant acquaintances—
more revelation of personal emotional state and empathizing with the other’s state
The Curse of Monkey Island also makes use of dominance cues to heighten comic effect (listen to
Clip 7.9) © 1997 Lucasfilm Entertainment Company Ltd All rights reserved.
F I G U R E
7.5
F I G U R E
7.6
The peons in Warcraft III are charmingly submissive in their voices and responses to player
commands (listen to Clip 7.10) Warcraft III: Reign of chaos provided courtesy of Blizzard
Entertainment, Inc.
Trang 9through modulation of your own voice Intimacy with someone is often reflectedwith a more breathy quality in the voice For an example of the breathiness of inti-
macy, contrast the clips from Grim Fandango above, with Clip 7.11—a conversation
between Manny and his love interest, Meche Both Manny and Meche have a greatdeal of breathiness in their voices
Individual personality can come through in the voice as characteristic
patterns of emotion and energy For example, in Grim Fandango, the hat-check
girl is a high-energy character who makes rapid turns from enthusiasm to anger(see Figure 7.7, Clip 7.12)
Social Interaction Logistics
Vocal modulations during an interaction show that a person is listening andcomprehending and also help to orchestrate turn-taking in conversation “Back-channel” responses such as “uh hunh” make the speaker feel the listener isengaged with what is happening People also use such noises to indicate that theyare still thinking or to express a range of emotions in response to a statement beforethey can put them into words
Games with elaborate and extensive cut scenes, such as Final Fantasy X,
make artful use of these sorts of cues to reveal the nuances of relationshipsamong characters (see Figure 7.8)
Back-channel responses may be one reason that people enjoy using enabled multiplayer online games as well—players can hear the triumph ordespair in one another’s voices as they play, heightening the experience itself, andcan use vocal cues (e.g., “whoa!” or “uhhhh ”) to help guide one another’sactions
voice-The hat-check girl in Grim Fandango has a distinctive way of speaking (listen to Clip 7.12).
© 1998 Lucasfilm Entertainment Company Ltd All rights reserved.
F I G U R E
7.7
Trang 107.3 DESIGN POINTERS
Missed Opportunity: Real-Time Vocal Adaptation
Currently, NPCs rarely offer real-time back-channel sounds and comments during
game play Revealing more complex awareness of a player as well as reactions to
the player through emotion- and information-laden audio cues as play situations
unfold could greatly increase the sense of social presence and connection a player
feels toward an NPC Imagine a sidekick or a just-rescued character gasping as the
player executes a tricky move or making a subtle noise of doubt and hesitation as
the player starts to move in a fruitless direction
This will become increasingly practical as voice synthesis becomes more andmore realistic, eliminating the need for a huge body of prerecorded audio files (see
Section 7.8, for more information about speech synthesis)
The cut-scenes in Final Fantasy X use vocal cues to heighten the player’s experience of the NPCs’
emotions and unfolding relationships © 2001 Square Enix Co., Ltd Character Design:Tetsuya Nomura.
Trang 117.3.2 Give NPCs Audio Personality
If a character has strong personality traits, make sure they come through inthe voice as well It is possible to create humorous contrasts between voice
and appearance (as in the case of Daxter from Jak and Daxter, discussed in
Chapter 2; see Figure 7.10)
7.3.3 Use Voice (and Music) as an Emotional Regulator
Character voices can make a player calmer, more enthusiastic, triumphant, souse voice to shape a player’s emotional experience of game play: light-heartedwords after an intense battle sequence from a sidekick, for example, or a gruff
F I G U R E
7.9
Final Fantasy X uses vocal cues to heighten the player’s sense of characters’ relationships.
© 2001 Square Enix Co., Ltd Character Design: Tetsuya Nomura.
Daxter (from Jak and Daxter) has a dominant voice and mannerisms and a small body (see Clip 2.7) Jak and Daxter: The Precursor Legacy is a registered trademark of Sony
Computer Entertainment America Inc Created and developed by Naughty Dog, Inc.
© 2001 Sony Computer Entertainment America Inc.
F I G U R E
7.10
Trang 127.5 INTERVIEW: MIT MEDIA LAB’S ZEYNEP INANOGLU AND RON CANEEL
To respond in real time with appropriate emotion, a character needs to know how aplayer is feeling Designers can fake this social awareness to some degree becausemuch is known about the state of the player from the game engine itself—did theplayer just triumph, get badly beaten, and so forth However, there is work beingdone on alternative methods for assessing player emotion Speech researchershave been working for years to be able to detect the traces of emotions in voices.Increases in processing power and in understanding of emotion cues in voices isbeginning to lead to results
pep talk from a guide or mentor’ if things went badly Consider using music
to bolster the effects of an NPC’s words, as well as to help manage playeremotions as game play unfolds
7.3.4 Voice Checklist
When specifying the audio for each character in a game, take a moment toconsider each type of social cue As audio assets are created, revisit the criteria
to see if the desired qualities are coming through:
• Emotional state How is this character feeling right now? In general?
Toward the player? Toward other NPCs?
• Social status and context What is this character’s relationship to the others
in the action? In general? What about right now?
• Interaction logistics with the player and other characters How does the
character acknowledge the actions and reactions of other characters?
and Ron Caneel
Zeynep Inanoglu and Ron Caneel, graduate students at the MIT Media Laboratory,have created a program and interface for detecting the emotional content of voicemessages The system, called Emotive Alert, looks for vocal patterns indicatingvalence (positive or negative feelings), activation (level of energy), formality, andurgency The system is meant to allow the user to sort and prioritize messages and
to alert the user to those that are most urgent
Trang 13Q: What was your inspiration for creating Emotive Alert?
Emotive Alert was mainly inspired by a seminar that we both attended last spring (2004).Both of us had various experiences working with speech signals, so when we came upwith the idea, Professor Rosalind Picard, who was giving the seminar, encouraged us totake on this project It also helped that Zeynep had access to her group’s voicemailsystem and was already using the voicemail data in other projects
Q: How did you choose the emotions to analyze from the messages?
In addition to the classical valence-arousal dimensions (Russell 1980) (happy/sad andexcited/calm) we chose urgency and formality since these are more interesting to look at
in the voicemail domain Since our approach only analyzes prosodic speech features(intonation, perceived loudness, rhythm) we hoped that these features would vary suffi-ciently in the dimensions that we chose
Q: Could the method you’ve evolved for analyzing the messages be helpful for analyzing “trash talk” among players in an online game-play environment?
Our method can be retrained to detect variances from a given speaker’s normal speakingstyle Acoustically, one would hope that these variances imply unusual behavior(i.e., trash talk in games) However, to make such systems reliable, a key-word spottingcapability should also be incorporated along with acoustic tracking
Q: Where do you think voice analysis and synthesis are heading next? Will there
be effective real-time analysis of emotion in conversation? What about lifelike synthesis of emotion in voices?
There is a lot of room to grow in both emotional synthesis and analysis Effective time analysis of emotion in conversation is a possibility, depending on what emotionalcategories we are tracking The problem is not only an issue of implementation but also
real-of emotions theories and available emotion data to train these systems on
Emotive Alert analyzes the pitch and energy of a voicemail message, applying emotion models
to suggest the predominant emotional tone of the message to the user.
F I G U R E
7.11
Trang 147.8 FURTHER READING
This chapter described social qualities of voices, including emotion, social context,and identity, and the handling of social logistics in interactions Design discussionincluded the power of ongoing vocal feedback, a missed opportunity in makingNPCs even more lifelike and engaging, as well as ways to incorporate other socialcues into character vocal design Part IV shifts focus to particular social functionsthat characters have in games and how these should affect design thinking
Each person should capture a brief segment from a movie or television
show in which two characters are speaking Take turns listening to (not
watching) these brief snippets of dialogue in a group and have everyonetry to identify the relative social status and relationship of the char-acters as well as their personality traits, emotional state, and as much ofthe social context as possible If there are members of the group fluent
in two languages, they should bring snippets from their second guage See if the group can identify status, personality, and emotionsregardless of understanding the words Discuss what it is that you arehearing and which cues are the most legible and accurate (e.g., that thegroup can most easily identify and agree upon)
Emotion and Reason
Damasio, A R 1994 Descartes’ Error: Emotion, Reason, and the Human Brain.
New York: Quill (an Imprint of HarperCollins Publishers)
Russell, J A 1980 A circumplex model of affect Journal of personality and socialpsychology 39(1-sup-6) Dec 1980, 1161–1178
Voice and Emotion
Bachorowski, J 1999 Vocal expression and perception of emotion Current
Direc-tions in Psychological Science 8(2):53–57.
Kappas, A., U Hess, and K R Scherer 1991 Voice and emotion In Fundamentals
of Nonverbal Behavior, eds R S Feldman and B Rimé, 200–238 Cambridge:
Cambridge University Press
Massaro, D W., and P B Egan 1996 Perceiving affect from the voice and the face.Psychonomic Bulletin & Review, 3(2), 215–221
Trang 15ISCA 2003 Speech Communication 40 (1 and 2) April Special issues on emotion
and speech based upon ISCA Speech and Emotion workshop
van Bezooyen, R 1984 Characteristics and Recognizability of Vocal Expressions of
Emotion Dordrecht, Holland: Foris Publications.
Music and Voice
Juslin, P N., and P Laukka 2003 Communication of emotions in vocal expression
and music performance: Different channels, same code? Psychological Bulletin
129(5):770–814
Voice and Social Characteristics
Tepper, D T., Jr., and R F Haase 2001 Verbal and nonverbal communication of
facilitative conditions In Helping Skills: The Empirical Foundation ed C E Hill.
Washington, DC: American Psychological Association
Tusing, K., and J Dillard 2000 The sounds of dominance: Vocal precursors of
per-ceived dominance during interpersonal influence Human Communication Research
26:148–171
Modeling Users from Voice
Fernandez, R., and R W Picard 2003 Modeling Drivers’ Speech Under Stress
Speech Communication 40:145–149.
Speech Synthesis
Burkhardt, F., and W F Sendlmeier 2000 Verification of Acoustical Correlates of
Emotional Speech Using Formant Synthesis In Proceeding ISCA workshop (ITRW)
on Speech and Emotion, Belfast 2000 http://www.qub.ac.uk/en/isca/proceeding Cahn, J E 1990 The generation of affect in synthesized speech Journal of the
American Voice I/O Society 8 (July):1–19.
Campbell, N 2004 Getting to the Heart of the Matter: Speech is More Than Just the
Expression of Text or Language LREC Keynote http://feast.his.atr.jp/nick/cv.html.
Murray, I., and J Arnott 1993 Toward the simulation of emotion in synthetic
speech: A review of the literature on human vocal emotion Journal Acoustical
Society of America 2:1097–1108.
Linguistics
Clark, H H 1996 Using Language Cambridge: Cambridge University Press.