170. A Vietnamese 3D taking face for embodied conversational agents

Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking.. The face’s architectu

Trang 1

A Vietnamese 3D Taking Face for Embodied

Conversational Agents

Thi Duyen Ngo, The Duy Bui University of Engineering and Technology, Vietnam National University, Hanoi

Email: {duyennt,duybt}@vnu.edu.vn

Abstract—Conversational agents are receiving significant

at-tention from multi-agent and human computer interaction

re-search societies Many techniques have been developed to enable

these agents to behave in a human-like manner In order to do

so, they are simulated with similar communicative channels as

humans Moreover, they are also simulated with emotion and

personality In this work, we focus on issue of expressing emotions

for embodied-agents We present a three dimensional face with

ability to speak emotional Vietnamese speech and naturally

express emotions while talking Our face can represent lip

movements during emotionally pronouncing Vietnamese words,

and at the same time it can show emotional facial expressions

while speaking The face’s architecture consists of three parts:

Vietnamese Emotional Speech Synthesis module, Emotions to

Facial Expressions module, and Combination module which

creates lip movements when pronouncing Vietnamese emotional

speech and combines these movements with emotional facial

expressions We have tested the face in the football supporter

domain in order to confirm its naturalness The face is simulated

as the face of a football supporter agent which experiences

emotions and expresses emotional expressions in his voice as well

as on his face.

Keywords—Conversational Agent, Vietnamese 3D Talking Face,

Emotional Speech, Emotional Facial Expression.

I INTRODUCTION One particularity of humans is to have emotions, this makes

people different from all other animals Emotions have been

studied for a long time and results show that they play an

important role in human cognitive functions Picard has

sum-marized this in her “Affective Computing” [1] This has also

been supported by many other scientists [2][3] Recognizing

the importance of emotions to human cognitive functions,

Picard [1] concluded that if we want computers to be genuinely

intelligent, to adapt to us, and to interact naturally with us, then

they will need the ability to recognize and express emotions,

to model emotions, and to show what has come to be called

“emotional intelligence”

Conversational agents have become more and more

com-mon in the multimedia worlds of films, educative applications,

e-business, etc In order to make these agents more believable

and friendly, they are simulated with emotion and personality

as well as communicative channels such as voice and facial

expression, etc As early as the 1930s, traditional character

animators have incorporated emotion into animated characters

to make audiences “believe in characters, whose adventures

and misfortunes make people laugh - and even cry” [4] The

animators believe that emotion, appropriately timed and clearly

expressed, is one of the keys to creating the quality of animated films In the areas of computational synthetic agents, emotions have received much attention for their influences in creating believable characters, e.g [5][6]

According to [7], facial expressions are one of the most important sources of information about a person’s emotional state Psychologists and other researchers have long recognized the importance of facial displays for judging emotions and they have probably received as much attention as all other expressive channels combined The second most important expressive channel for judging emotions is speech; "much of the variation in vocal behavior can be attributed to the level

of arousal" [7] Therefor, in our work, we focus on these two channel in solving issue of expressing emotions for embodied conversational agents In this paper, we propose a talking face system which is combination of our previous works

We present a three dimensional face with ability to speak emotional Vietnamese speech and naturally express emotions

on face while talking Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking To our knowledge, there is no such face proposed before The face is built based on the two systems which we have presented These systems are: the rule-based system for synthesizing Vietnamese emotional speech, which is presented

in [8]; the system providing a mechanism for simulating continuous emotional states of conversational agents, which

is proposed in [9] Besides, in this work, there is additional module for creating lip movements when pronouncing Viet-namese emotional speech and combining these movements with emotional facial expressions We have tested the face

in the football supporter domain in order to confirm its naturalness The face is simulated as the face of a football supporter agent which experiences emotions and expresses emotional expressions in his voice as well as on his face The rest of the paper is organized as follows First, we present the face’s architecture in Section II In this section, the constructions and operations of three main modules of the face are described in three subsections We then test the face

in the football supporter domain in Section III Finally, the conclusion is presented in Section IV

II SYSTEM ARCHITECTURE The talking face is built based on the two systems that

we have proposed before [8][9] An overview of the face’s architecture can be seen in Figure 1 The face takes neutral speech with a corresponding phoneme list with temporal infor-mation and a series of Emotion State Vector (ESV) over time 978-1-4799-8044-4/15/$31.00 c 2015 IEEE

The 2015 IEEE RIVF International Conference on Computing & Communication Technologies

Research, Innovation, and Vision for Future (RIVF)

Trang 2

Figure 1: The face’s architecture.

as input In a more perfect way, one part of the input should

be text instead of neutral speech with a phoneme list But

our work do not focus on text to speech, we only concentrate

on synthesizing Vietnamese emotional speech from neutral

speech Therefor, we suppose that there is a Vietnamese

text to speech system which returns neutral speech with a

corresponding phoneme list from text, and we use this as one

part of the input for our system There are three main modules

in the talking face system: Vietnamese Emotional Speech

Synthesis (VESS) module, Emotions to Facial Expressions

(EFE) module, and Combination module The VESS module

uses the system in [8] to convert Vietnamese neutral speech to

Vietnamese emotional speech according to the corresponding

emotional style EFE module uses the system in [9] to

sim-ulate continuous emotional facial expressions from the series

of EVS Combination modules creates lip movements when

pronouncing Vietnamese emotional speech from the list of

phonemes (with temporal information) and combines these

movements with emotional facial expressions Finally, facial

expression and movements are displayed with synchronized

emotional speech on a 3D face In our system, we use the

muscle-based 3D face model which was presented in [10] This

face model is able to produce both realistic facial expressions

and real-time animation for standard personal computers The

construction and operation of the system’s components will be

described in the next subsections

A Vietnamese Emotional Speech Synthesis (VESS) module.

The VESS module uses our proposed work [8] to convert

Vietnamese neutral speech to Vietnamese emotional speech

according to the corresponding emotional style The

mod-ule takes Vietnamese neutral speech with a corresponding

phoneme list as input and results Vietnamese emotional speech

as output In [8], we presented a framework used to simulate

four basic emotional styles of Vietnamese speech, by means

of acoustic feature conversion techniques applied to neutral

utterances Results of perceptual tests showed that emotional

styles were well recognized More detailed information about

the work in [8] will be described in the following

First, we carried out some analyses of acoustic features

of Vietnamese emotional speech, accomplished to find the

relations between acoustic feature variations and emotional

states in Vietnamese speech The analyses were performed on

a speech database which consisted of Vietnamese utterances

of 19 sentences, produced by one male and one female

professional Vietnamese artists The two actors were asked to produce utterances using five different styles: neutral, happi-ness, cold anger, sadhappi-ness, and hot anger; each sentence had one utterance in each of the five styles, for each speaker From the database, we then extracted acoustic cues related to emotional speech: The F0 contour, the power envelope, and the spectrum were calculated by using STRAIGHT [11] Time duration was manually specified with the partly support of WaveSurfer [12]

At the utterance level, a total of 14 acoustic parameters were calculated and analysed; average pitch and average power at the syllables level were also examined For each utterance, from extracted F0 information, highest pitch (HP), average pitch (AP), and pitch range (PR) were measured, average pitch

of syllables were also examined From the extracted power envelope, the considered acoustic parameters were maximum power (HPW), average power (APW), and power range (PWR), average power of syllables Next, with the duration, for each utterance, the information of time segmentation was manually measured first The measurement included phoneme number, time(ms), and vowel Then the duration of all phonemes, both consonants and vowels, as well as pauses were specified From there, mean of pause lengths (MPAU), total length (TL), consonant length (CL), ratio between consonant length and vowel length (RCV) were measured Finally, from the extracted spectrum, formants (F1, F2, F3) and spectral tilt (ST) were examined Formant measures were taken approximately at the vowel midpoint of the vowels using LPC-order 12; spectral tilt was calculated from H1-A3, where H1 is the amplitude of the first harmonic and A3 is the amplitude of the strongest harmonic in the third formant

After performing the above extraction phase, with each of

190 utterances we had a set of 14 values corresponding to the 14 acoustic parameters at the utterance level From these

190 sets, the values of variation coefficients with respect to the baseline (neutral style) were calculated As a result, we had 152 sets of 14 values of variation coefficients In which there were

19 sets for each of four emotional styles, for each of the two speakers After that, with each pack of these 19 sets, clustering was carried out Then, the cluster which had the largest number

of sets was chosen Finally, from the chosen cluster, the mean values of variation coefficients corresponding to 14 parameters

of each emotional style were calculated At the syllable level, mean values of variation coefficients corresponding to mean F0, average power, mean duration of the syllables belonging the word/compound word at the beginning and at the end

of the sentence were also calculated We were interested in these syllables because of the fact that when the emotional state changes, acoustic features vary more in some syllables of phrases instead of uniformly changes in whole phrases When analyzing the database we found that syllables belonging the word/compound word at the beginning and at the end of the sentence varied more than other syllables The mean values

of variation coefficients at the utterance level as well as the syllables level were used to form rules to convert neutral speech

to emotional speech

In our work, we used speech morphing technique to produce Vietnamese emotional speech Our speech morphing process is presented in Figure 2 Fistly, STRAIGHT [11] was used to extract F0 contour, power envelope, and spectrum

of the neutral speech signal while segmentation information was measured manually Then, acoustic features in terms

Trang 3

Figure 2: Speech morphing process using STRAIGHT

Figure 3: Acoustic Feature Modification Process

of F0 contour, power envelope, spectrum and duration were

modified basing on morphing rules inferred from variation

coefficients obtained in the analytic stage These

modifica-tions were carried out with taking into account of variamodifica-tions

of acoustic features at the syllable Syllables belonging the

words/compound words at the beginning and the end of the

utterance are modified more Finally, emotional speech is

synthesized from the modified F0 contour, power envelope,

spectrum and duration using STRAIGHT The modifications

are carried out according to the flow presented in Figure 3

B Emotions to Facial Expressions (EFE) module.

EFE module uses our proposed system in [9] to simulate

continuous emotional facial expressions The module takes a

series of Emotion State Vector (ESV) over time as input and

results a corresponding series of Facial Muscle Contraction

Vector (FMCV) as output In [9], the scheme providing a

mechanism to simulate continuous emotional states of a

con-versational agent was brought out based on the temporal

pat-terns of facial activities of six basic emotions These temporal

patterns were results of the analysis on a spontaneous video

database which consisted of video sequence selected from three databases namely MMI [13], FEEDTUM [14], DISFA [15]

We used facial expression recognition techniques to analyze the database and then extracted the general temporal patterns for facial expressions of the six basic emotions The analysis process was performed through four steps First, for each frame

of the input video, the Face Detector module used Viola Jones algorithm [16] to detect the face and returns its location Then the ASM Fitting module extracted feature points from the detected face using ASM fitting algorithm [17]; ASM shape of the face containing location of 68 feature points was returned From this shape, the Face Normalization module carried out the normalizing task in order to set the shape into a common size (the distance between the centers of eyes was used as the standard distance) Finally, the AUs Intensity Extractor module used normalized feature point locations to calculate the intensity of Action Units (AUs) which are related

to the emotion style of the input video (Action Unit was defined by Ekman and Friesen They developed Facial Action Coding System (FACS) [18] to identify all possible visually distinguishable facial movements It involves identifying the various facial muscles that individually or in groups cause changes in facial behaviors These changes in the face and the underlying muscles that caused these changes are called Action Units (AU) For each emotion, there is a set of related AUs to classify it from the others.)

Figure 4: (a): Temporal pattern for facial expressions of happiness and sadness (b): Temporal pattern for facial expressions of fear, angry, disgust, and surprise emotions For a video of each emotion, we had a temporal series of intensity values for each AU This series was then extracted and graphing By observing these graphics and videos, we brought out a hypothesis that the facial expressions happen in series with decreasing intensity when a corresponding emotion is triggered Thence, we proposed pre-defined temporal patterns for facial expressions of six basic emotions (Figure 4) In this pattern, the solid line part is always present while the dash line part may be absent As shown in the pattern, although the internal emotional state may have constant sufficient intensity

in a long time, the corresponding facial expressions are not always at the same intensity in this long duration On the other hand, the facial expressions appear with the intensity corresponding to the intensity of the emotion, then stay in this state for a while, and then fall near the initial state We call this process is a cycle We define a cycle of facial expressions as:

E = (P, T s, T e, Do, Dr) where P defines the target intensity of the expressions; T s and

T eare the starting time and the ending time of the cycle; Do,

Dr are onset duration and offset duration of the expressions, respectively The process in which the expressions occur in a cycle is described as a function of time:

Trang 4

where φ+and φ−are the functions that describe the onset and

offset phase of expressions

φ+(x, Do) = exp(ln2Dox) − 1

φ−(x, Dr) = exp(ln2 −ln2−ln(Pa+1)

Dr x) − 1

In order to verify the reasonableness of the pre-defined

temporal patterns, we performed the fitting task for all

tempo-ral AU profiles If the distance between the centers of two eyes

was normalized to 1, the sum of squares due to error (SSE) of

the fit was 0.0207 Performing the fitting task for all temporal

AU profiles, we found that the average of the sum of squares

due to error (SSE) was 0.055 with the standard deviation was

0.078 These values showed that the above temporal patterns

and the fitting function were reasonable

Basing on the temporal patterns, we proposed a scheme

illustrated in Figure 5 to improve the conversion of continuous

emotional states of an agent to facial expressions The idea was

that the facial expressions happen in series with decreasing

intensity when a corresponding emotion is triggered For

example, when an event happens that triggered the happiness

of a person, he/she would not smile in full intensity during

the time the happiness lasts Instead, he/she would express a

series of smiles in decreasing intensity Thus, emotional facial

expressions appear only when there is a significant stimuli that

changes the emotional states, otherwise, the expressions in the

face is kept at low level displaying moods rather than emotions

even when the intensities of emotions are high The emotional

expressions will not stay on the face for a long time while

emotions decay slowly However, the expressions of moods

can last for much longer time on the face

Figure 5: The scheme to convert continuous emotional states

of an agent to facial expressions

The Expression Mode Selection adjusts the series of ESV

over time so that corresponding facial expressions happen

temporally in the way similar to the temporal patterns This

module determines whether an emotional facial expression

should be generated to express the current emotional state or

the expressions in the 3D face kept at low level displaying

moods rather than emotions It firstly checks if there is a

significant increase in the intensity of any emotion during last

Ti seconds (the duration of an emotional expression cycle),

that is if:

ex

i −ex−1

i > θ where t−Ti≤x ≤ t, t is the current time, and θ is the threshold

to activate emotional facial expressions (According to analytic

results on the video database, Ti has value of about 3.5 for

happiness, 5.3 for sadness, 3.6 for disgust emotion, 3 for angry

and fear emotions, and 2.7 for surprise emotion) If there is

a significant change, the ESV is converted directly to FMCV using the fuzzy rule based system proposed in [19]; and the cycle − tagi is set to 1 for happy and sad emotions, is set to

3 for fear, angry, surprise, and disgust emotions If not, the ESV is normalized as follows: t′

i is the time at which the most recent cycle ends, t is the current time,

• if cycle − tagi= 1 and t′i+ 3 ≤ t ≤ ti′+ 3 + Ti∗0.8then

et

i= eti∗0.8and cycle − tagi= 2

• if cycle − tagi= 2 and t′i+ 3 ≤ t ≤ ti′+ 3 + Ti∗0.6then

eti= eti∗0.6and cycle − tagi= 3

• otherwise, et

i is normalized to lower intensity In this way, the emotions are displayed as moods, the low-intensity and long-lasting state of emotions

After being normalized, the EVS is converted to FMCV using the same fuzzy rule based system [19]

C Combination module

The Combination module creates lip movements when pronouncing Vietnamese emotional speech from the list of phonemes with temporal information and combines these movements with emotional facial expressions from EFE mod-ule

Visemes for Vietnamese phonemes.

In order to create lip movements during speaking Viet-namese, firstly we need to have a set of visemes for the face, corresponding with Vietnamese phonemes Similar to our pre-vious work [20], we follow the rules in [21] and [22] to specify the correlative visemes with individual Vietnamese phonemes According to [21], Vietnamese phonemes are divided into two types: vowel and consonant About the visemes of the vowels, these phonemes are separated as well as expressed according to three main factors: the position of the tongue, the open degree

of the mouth, and the figure of the lips On the open of the mouth, the vowels are divided into four categories: close vowel

(i), semi-close vowel (ê), semi-open vowel (e), and open vowel (a).The narrow - wide property of the vowels is specified by

the gradually widening of the mouth On the figure of the lips, the vowels are separated into two types: round-lip vowel (o,

ô)and unround-lip vowel (ơ) The round or unround property

is determined by the shape of the lips Figure 6 show the relationships between vowels and the above two properties The horizontal lines express the open degree of the mouth The vertical lines express the shape of the lips; the left part shows unround-lip vowels, the right part shows the round-lip vowels About the visemes of consonants, these phonemes are separated as well as expressed according to two main factors: where and how phonemes are pronounced According to the first factor, consonants are divided in to three types: lip

-consonant (b, p,v, ph), tongue - -consonant (đ, ch, c,k), and fauces - consonant (h).

Because the 3D face model [10] which we use simulates the effect of vector muscle, sphincter muscle and jaw rotation,

it can display facial movements during Vietnamese speech The open-degree of the mouth corresponds with the amount

of the jaw rotation, and the round-degree of the lips depends

on muscles which affect on lips For simplicity, some vowels which are fairly similar are added up into one group In order

Trang 5

Figure 6: Vowel trapezium.

to create the vowel visemes, the amount of the jaw rotation

and the contraction degree of muscles affecting on lips are

originally determined basing on the vowel trapezium After

that, these values are manually refined relying on comparisons

between the vowel visemes of the 3D face and the vowel

visemes of real human face To create visemses for consonants,

we care about only positions where phonemes are pronounced

According to this factor, we divided consonants into three

types: lip - lip consonant, lip - tooth consonant, and the last

type including the remaining consonants We follow the rules

in [21] and [22] to create original visemes for consonants And

after that we also manually refine these as we do with vowel

visemes

Combination of lip movements when talking.

Human speeches are always paragraph, sentence, or some

words These include a set of phonemes, some phonemes form

a word With each single phoneme, we already have a specific

viseme Now the request is that make the movement from

one viseme (e.g V1) to another viseme (e.g V2) gradual and

smooth in order to make lip movement during speech realistic

The most simple way is creating intermediate visemes of V1

and V2 by adding V1 and V2’s correlative parameter values

and thence taking the average values However, this way is

not a really good choice because the articulation of a speech

segment is not self-contained, it depends on the preceding and

upcoming segments In our approach, we apply the dominance

model [23] to create the coarticulation effect on lip

move-ments when talking Coarticulation is the blending effect that

surrounding phonemes have on the current phonemes In [23],

a lip movement correlative to a speech segment is represented

as a viseme segment Each viseme segment has dominance

over the vocal articulators which increase and decrease over

time during articulation This dominance function specifies

how close the lips come to reaching their target values of the

viseme A blending over time of the articulations is created

by overlapping dominance functions of adjacent movements

corresponding to articulatory commands Each movement has

a set of dominance functions, one for each parameter Different

dominance functions can overlap for a given movement The

weighted average of all the co-occurrent dominance functions

produce the final lip shape

Combination of emotional expression and facial movement

during speech.

In order to combine emotional facial expressions (the

output of EFE module) and above lip movements during

speaking Vietnamese, we apply the research proposed in [24]

They divided facial movements into groups called channels

according their type such as emotion displays, lip movements when talking, etc They then presented the schemes for combi-nation of movements in one channel and for combicombi-nation in dif-ferent channels These schemes have the resolution of possible conflicting muscles to eliminate unnatural facial animations

At a certain time, when there is a conflict between parameters

in different animation channels, the parameters involved in the movement with higher priority will dominate the ones with lower priority In our talking face, we give higher priority to the lip movement when talking The final facial animations resulted from combination are displayed on the 3D talking face with synchronized synthesized emotional speech

III EVALUATION

In order to test our talking face, we use ParleE - an emotion model for a conversational agent [25], and put the face in the football supporter domain [26] ParleE is a quantitative, flexible and adaptive model of emotions in which appraising events is based on learning and a probabilistic planning algo-rithm ParleE also models personality and motivational states and their roles in determining the way the agent experiences emotion This model was developed in order to enable con-versational agents to respond to events with the appropriate expressions of emotions with different intensities We put the face in the domain of a football supporter [26] because football

is an emotional game; there are many events in the the game that trigger emotions of not only players but also of coaches, supporters, etc Testing the face with the football supporter’s domain gives us the chance to test many emotions as well as the dynamics of emotions because the actions in a football match happen fast Our talking face roles as the face of a football (soccer) supporter agent The agent is watching a football match in which a team, which he supports, is playing The agent can experience different emotions by appraising events based on his goals, standards, and preferences Then the emotions are showed on the face and in the voice of our talking face In short, the purpose of using ParleE and football supporter domain is to provide good input to test our talking face

Figure 7: A picture of the talking face

Figure 7 illustrates a picture of our talking face We have performed an experiment to collect evaluation of its ability

to express continuous emotional states Following Katherine Isbister and Patrick Doley [27], we selected user test method for evaluating experiments related to emotions and facial expressions To obtain the user’s assessment, we showed them

Trang 6

Figure 8: Summary of interview results from the user test

a clip of the talking face and then asked them to answer some

questions The face was tested with 20 users (10 males and

10 females) aged between 15 and 35 with an average age

of 27 years Each user test session took about 17 minutes

Sessions began with a brief introduction to the experiment

process and the talking face During the next 7 minutes, the

user watched a short clips of the face Finally, each user was

interviewed separately about his/her assessment on the face

We asked a total of four questions as showed in Figure 8

According to users’ assessment, the talking face was able to

express emotions on the face and in the voice quite naturally

IV CONCLUSION

We have presented a three dimensional face with ability

to speak emotional Vietnamese speech and naturally express

emotions while talking Our face can represent lip movements

during emotionally pronouncing Vietnamese words, and at the

same time it can show emotional facial expressions while

speaking We have tested the face in the football supporter

domain in order to confirm its naturalness The face was

simulated as the face of a football supporter which experiences

emotions and expresses emotional expressions in his voice as

well as on his face The experiment results show that our

talking face is able to express emotions on the face and in

the voice quite naturally

REFERENCES

[1] R Picard, Affective Computing MIT Press, Cambridge, MA, 1997.

[2] D H Galernter, The muse in the machine. Free Press, New York,

1994.

[3] A R Damasio, Descartes’ error: Emotion, reason, and the human

brain. G.P Putnam, New York, 1994.

[4] F Thomas and O A Johnsto, The Illusion of Life. Abbeville Press,

New York, 1981.

[5] C Pelachaud, “Modelling multimodal expression of emotion in a virtual

agent.” Philosophical Transactions of the Royal Society B: Biological

Sciences, vol 6364, no 1535, pp 3539–3548, 2009.

[6] M C Prasetyahadi, I R Ali, A H Basori, and N Saari, “Eye, lip

and crying expression for virtual human,” International Journal of

Interactive Digital Media, vol 1(2), 2013.

[7] G Collier, Emotional expression Lawrence Erlbaum Associates, New

Jersey, 1985.

[8] T D Ngo, M Akagi, and T D Bui, “Toward a rule-based synthesis

of vietnamese emotional speech,” in Proc of the Sixth International

Conference on Knowledge and Systems Engineering (KSE 2014), pp.

129–142.

[9] T D Ngo, T H N Vu, V H Nguyen, and T D Bui, “Improving

simulation of continuous emotional facial expressions by analyzing

videos of human facial activities,” in Proc of the 17th International

Conference on Principles and Practice of Multi-Agent Systems (PRIMA

2014), pp 222–237.

[10] T D Bui, D Heylen, and A Nijholt, “Improvements on a simple

muscle-based 3d face for realistic facial expressions,” in Proc

CASA-2003, 2003, p 33–40.

[11] H Kawahara, I Masuda-Katsuse, and A de Cheveigne, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a

repetitive structure in sounds,” Speech Communication, vol 27, pp 187–

207, 1999.

[12] “Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html.” [13] M Pantic, M Valstar, R Rademaker, and L Maat, “Web-based database

for facial expression analysis.” Proc 13th ACM Int’l Conf Multimedia

and Expo, pp 317–321, 2005.

[14] F Wallhoff, “The facial expressions and emotions database homepage

(feedtum).” www.mmk.ei.tum.de/ waf/fgnet/feedtum.html.

[15] S Mohammad Mavadati, H M Mohammad, K Bartlett, P Trinh, and

J F Cohn, “Disfa: A spontaneous facial action intensity database,”

IEEE Transactions on Affective Computing, vol 4, no 2, pp 151–160, 2013.

[16] P Viola and M Jones, “Robust real-time object detection„” Tech.

rep.,Cambridge Research Laboratory Technical report series., no 2, 2001.

[17] T Cootes, C Taylor, D Cooper, and J Graham, “Active shape

models-their training and application.” Computer Vision and Image

Understand-ing, vol 61, no 1, pp 38–59, 1995.

[18] P Ekman and W V Friesen, Facial Action Coding System Palo Alto,

CA: Consulting Psychologists Press, 1978.

[19] T D Bui, D Heylen, M Poel, and A Nijholt, “Generation of facial

ex-pressions from emotion using a fuzzy rule based system,” in Australian

Joint Conf on Artificial Intelligence (AI 2001) Berlin: Lecture Notes

in Computer Science, Springer, 2001, pp 83–95.

[20] T D Ngo and N L Tran and Q K Le and C H Pham and L H.

Bui, “An approach for building a vietnamese talking face,” Journal on

Information and Communication Technologies, no 6(26), 2011.

[21] X T Đỗ and H T Lê, Giáo trình tiếng Việt 2 Nhà xuất bản đại học

Sư Phạm, 2007.

[22] T L Nguyễn and T H Nguyễn, Tiếng Việt (Ngữ âm và Phong cách

học). Nhà xuất bản đại học Sư Phạm, 2007.

[23] M M Cohen and D W Massaro, “Modeling coarticulation in synthetic

visual speech,” in Models and Techniques in Computer Animation, pp.

139–156.

[24] T D Bui, D Heylen, and A Nijholt, “Combination of facial movements

on a 3d talking head,” in Proc of the Computer Graphics International,

2004, pp 284 – 290.

[25] ——, “Parlee: An adaptive plan-based event appraisal model of

emo-tions,” in In Proc KI 2002: Advances in Artificial Intelligence, p.

129–143.

[26] ——, “Building embodied agents that experience and express emotions:

A football supporter as an example,” in Proceedings 17th annual

conference on Computer Animation and Social Agents (CASA2004), 2004.

[27] K Isbister and P Doyle, “Design and evaluation of embodied

conver-sational agents: a proposed taxonomy,” in In Proceedings of AAMAS

2002 Workshop on Embodied Conversational Agents: Let’s Specify and Evaluate Them!, Bologna, Italy, 2002.

Định dạng
Số trang	6
Dung lượng	531,9 KB