Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking.. The face’s architectu
Trang 1A Vietnamese 3D Taking Face for Embodied
Conversational Agents
Thi Duyen Ngo, The Duy Bui University of Engineering and Technology, Vietnam National University, Hanoi
Email: {duyennt,duybt}@vnu.edu.vn
Abstract—Conversational agents are receiving significant
at-tention from multi-agent and human computer interaction
re-search societies Many techniques have been developed to enable
these agents to behave in a human-like manner In order to do
so, they are simulated with similar communicative channels as
humans Moreover, they are also simulated with emotion and
personality In this work, we focus on issue of expressing emotions
for embodied-agents We present a three dimensional face with
ability to speak emotional Vietnamese speech and naturally
express emotions while talking Our face can represent lip
movements during emotionally pronouncing Vietnamese words,
and at the same time it can show emotional facial expressions
while speaking The face’s architecture consists of three parts:
Vietnamese Emotional Speech Synthesis module, Emotions to
Facial Expressions module, and Combination module which
creates lip movements when pronouncing Vietnamese emotional
speech and combines these movements with emotional facial
expressions We have tested the face in the football supporter
domain in order to confirm its naturalness The face is simulated
as the face of a football supporter agent which experiences
emotions and expresses emotional expressions in his voice as well
as on his face.
Keywords—Conversational Agent, Vietnamese 3D Talking Face,
Emotional Speech, Emotional Facial Expression.
I INTRODUCTION One particularity of humans is to have emotions, this makes
people different from all other animals Emotions have been
studied for a long time and results show that they play an
important role in human cognitive functions Picard has
sum-marized this in her “Affective Computing” [1] This has also
been supported by many other scientists [2][3] Recognizing
the importance of emotions to human cognitive functions,
Picard [1] concluded that if we want computers to be genuinely
intelligent, to adapt to us, and to interact naturally with us, then
they will need the ability to recognize and express emotions,
to model emotions, and to show what has come to be called
“emotional intelligence”
Conversational agents have become more and more
com-mon in the multimedia worlds of films, educative applications,
e-business, etc In order to make these agents more believable
and friendly, they are simulated with emotion and personality
as well as communicative channels such as voice and facial
expression, etc As early as the 1930s, traditional character
animators have incorporated emotion into animated characters
to make audiences “believe in characters, whose adventures
and misfortunes make people laugh - and even cry” [4] The
animators believe that emotion, appropriately timed and clearly
expressed, is one of the keys to creating the quality of animated films In the areas of computational synthetic agents, emotions have received much attention for their influences in creating believable characters, e.g [5][6]
According to [7], facial expressions are one of the most important sources of information about a person’s emotional state Psychologists and other researchers have long recognized the importance of facial displays for judging emotions and they have probably received as much attention as all other expressive channels combined The second most important expressive channel for judging emotions is speech; "much of the variation in vocal behavior can be attributed to the level
of arousal" [7] Therefor, in our work, we focus on these two channel in solving issue of expressing emotions for embodied conversational agents In this paper, we propose a talking face system which is combination of our previous works
We present a three dimensional face with ability to speak emotional Vietnamese speech and naturally express emotions
on face while talking Our face can represent lip movements during emotionally pronouncing Vietnamese words, and at the same time it can show emotional facial expressions while speaking To our knowledge, there is no such face proposed before The face is built based on the two systems which we have presented These systems are: the rule-based system for synthesizing Vietnamese emotional speech, which is presented
in [8]; the system providing a mechanism for simulating continuous emotional states of conversational agents, which
is proposed in [9] Besides, in this work, there is additional module for creating lip movements when pronouncing Viet-namese emotional speech and combining these movements with emotional facial expressions We have tested the face
in the football supporter domain in order to confirm its naturalness The face is simulated as the face of a football supporter agent which experiences emotions and expresses emotional expressions in his voice as well as on his face The rest of the paper is organized as follows First, we present the face’s architecture in Section II In this section, the constructions and operations of three main modules of the face are described in three subsections We then test the face
in the football supporter domain in Section III Finally, the conclusion is presented in Section IV
II SYSTEM ARCHITECTURE The talking face is built based on the two systems that
we have proposed before [8][9] An overview of the face’s architecture can be seen in Figure 1 The face takes neutral speech with a corresponding phoneme list with temporal infor-mation and a series of Emotion State Vector (ESV) over time 978-1-4799-8044-4/15/$31.00 c 2015 IEEE
The 2015 IEEE RIVF International Conference on Computing & Communication Technologies
Research, Innovation, and Vision for Future (RIVF)
Trang 2Figure 1: The face’s architecture.
as input In a more perfect way, one part of the input should
be text instead of neutral speech with a phoneme list But
our work do not focus on text to speech, we only concentrate
on synthesizing Vietnamese emotional speech from neutral
speech Therefor, we suppose that there is a Vietnamese
text to speech system which returns neutral speech with a
corresponding phoneme list from text, and we use this as one
part of the input for our system There are three main modules
in the talking face system: Vietnamese Emotional Speech
Synthesis (VESS) module, Emotions to Facial Expressions
(EFE) module, and Combination module The VESS module
uses the system in [8] to convert Vietnamese neutral speech to
Vietnamese emotional speech according to the corresponding
emotional style EFE module uses the system in [9] to
sim-ulate continuous emotional facial expressions from the series
of EVS Combination modules creates lip movements when
pronouncing Vietnamese emotional speech from the list of
phonemes (with temporal information) and combines these
movements with emotional facial expressions Finally, facial
expression and movements are displayed with synchronized
emotional speech on a 3D face In our system, we use the
muscle-based 3D face model which was presented in [10] This
face model is able to produce both realistic facial expressions
and real-time animation for standard personal computers The
construction and operation of the system’s components will be
described in the next subsections
A Vietnamese Emotional Speech Synthesis (VESS) module.
The VESS module uses our proposed work [8] to convert
Vietnamese neutral speech to Vietnamese emotional speech
according to the corresponding emotional style The
mod-ule takes Vietnamese neutral speech with a corresponding
phoneme list as input and results Vietnamese emotional speech
as output In [8], we presented a framework used to simulate
four basic emotional styles of Vietnamese speech, by means
of acoustic feature conversion techniques applied to neutral
utterances Results of perceptual tests showed that emotional
styles were well recognized More detailed information about
the work in [8] will be described in the following
First, we carried out some analyses of acoustic features
of Vietnamese emotional speech, accomplished to find the
relations between acoustic feature variations and emotional
states in Vietnamese speech The analyses were performed on
a speech database which consisted of Vietnamese utterances
of 19 sentences, produced by one male and one female
professional Vietnamese artists The two actors were asked to produce utterances using five different styles: neutral, happi-ness, cold anger, sadhappi-ness, and hot anger; each sentence had one utterance in each of the five styles, for each speaker From the database, we then extracted acoustic cues related to emotional speech: The F0 contour, the power envelope, and the spectrum were calculated by using STRAIGHT [11] Time duration was manually specified with the partly support of WaveSurfer [12]
At the utterance level, a total of 14 acoustic parameters were calculated and analysed; average pitch and average power at the syllables level were also examined For each utterance, from extracted F0 information, highest pitch (HP), average pitch (AP), and pitch range (PR) were measured, average pitch
of syllables were also examined From the extracted power envelope, the considered acoustic parameters were maximum power (HPW), average power (APW), and power range (PWR), average power of syllables Next, with the duration, for each utterance, the information of time segmentation was manually measured first The measurement included phoneme number, time(ms), and vowel Then the duration of all phonemes, both consonants and vowels, as well as pauses were specified From there, mean of pause lengths (MPAU), total length (TL), consonant length (CL), ratio between consonant length and vowel length (RCV) were measured Finally, from the extracted spectrum, formants (F1, F2, F3) and spectral tilt (ST) were examined Formant measures were taken approximately at the vowel midpoint of the vowels using LPC-order 12; spectral tilt was calculated from H1-A3, where H1 is the amplitude of the first harmonic and A3 is the amplitude of the strongest harmonic in the third formant
After performing the above extraction phase, with each of
190 utterances we had a set of 14 values corresponding to the 14 acoustic parameters at the utterance level From these
190 sets, the values of variation coefficients with respect to the baseline (neutral style) were calculated As a result, we had 152 sets of 14 values of variation coefficients In which there were
19 sets for each of four emotional styles, for each of the two speakers After that, with each pack of these 19 sets, clustering was carried out Then, the cluster which had the largest number
of sets was chosen Finally, from the chosen cluster, the mean values of variation coefficients corresponding to 14 parameters
of each emotional style were calculated At the syllable level, mean values of variation coefficients corresponding to mean F0, average power, mean duration of the syllables belonging the word/compound word at the beginning and at the end
of the sentence were also calculated We were interested in these syllables because of the fact that when the emotional state changes, acoustic features vary more in some syllables of phrases instead of uniformly changes in whole phrases When analyzing the database we found that syllables belonging the word/compound word at the beginning and at the end of the sentence varied more than other syllables The mean values
of variation coefficients at the utterance level as well as the syllables level were used to form rules to convert neutral speech
to emotional speech
In our work, we used speech morphing technique to produce Vietnamese emotional speech Our speech morphing process is presented in Figure 2 Fistly, STRAIGHT [11] was used to extract F0 contour, power envelope, and spectrum
of the neutral speech signal while segmentation information was measured manually Then, acoustic features in terms
Trang 3Figure 2: Speech morphing process using STRAIGHT
Figure 3: Acoustic Feature Modification Process
of F0 contour, power envelope, spectrum and duration were
modified basing on morphing rules inferred from variation
coefficients obtained in the analytic stage These
modifica-tions were carried out with taking into account of variamodifica-tions
of acoustic features at the syllable Syllables belonging the
words/compound words at the beginning and the end of the
utterance are modified more Finally, emotional speech is
synthesized from the modified F0 contour, power envelope,
spectrum and duration using STRAIGHT The modifications
are carried out according to the flow presented in Figure 3
B Emotions to Facial Expressions (EFE) module.
EFE module uses our proposed system in [9] to simulate
continuous emotional facial expressions The module takes a
series of Emotion State Vector (ESV) over time as input and
results a corresponding series of Facial Muscle Contraction
Vector (FMCV) as output In [9], the scheme providing a
mechanism to simulate continuous emotional states of a
con-versational agent was brought out based on the temporal
pat-terns of facial activities of six basic emotions These temporal
patterns were results of the analysis on a spontaneous video
database which consisted of video sequence selected from three databases namely MMI [13], FEEDTUM [14], DISFA [15]
We used facial expression recognition techniques to analyze the database and then extracted the general temporal patterns for facial expressions of the six basic emotions The analysis process was performed through four steps First, for each frame
of the input video, the Face Detector module used Viola Jones algorithm [16] to detect the face and returns its location Then the ASM Fitting module extracted feature points from the detected face using ASM fitting algorithm [17]; ASM shape of the face containing location of 68 feature points was returned From this shape, the Face Normalization module carried out the normalizing task in order to set the shape into a common size (the distance between the centers of eyes was used as the standard distance) Finally, the AUs Intensity Extractor module used normalized feature point locations to calculate the intensity of Action Units (AUs) which are related
to the emotion style of the input video (Action Unit was defined by Ekman and Friesen They developed Facial Action Coding System (FACS) [18] to identify all possible visually distinguishable facial movements It involves identifying the various facial muscles that individually or in groups cause changes in facial behaviors These changes in the face and the underlying muscles that caused these changes are called Action Units (AU) For each emotion, there is a set of related AUs to classify it from the others.)
Figure 4: (a): Temporal pattern for facial expressions of happiness and sadness (b): Temporal pattern for facial expressions of fear, angry, disgust, and surprise emotions For a video of each emotion, we had a temporal series of intensity values for each AU This series was then extracted and graphing By observing these graphics and videos, we brought out a hypothesis that the facial expressions happen in series with decreasing intensity when a corresponding emotion is triggered Thence, we proposed pre-defined temporal patterns for facial expressions of six basic emotions (Figure 4) In this pattern, the solid line part is always present while the dash line part may be absent As shown in the pattern, although the internal emotional state may have constant sufficient intensity
in a long time, the corresponding facial expressions are not always at the same intensity in this long duration On the other hand, the facial expressions appear with the intensity corresponding to the intensity of the emotion, then stay in this state for a while, and then fall near the initial state We call this process is a cycle We define a cycle of facial expressions as:
E = (P, T s, T e, Do, Dr) where P defines the target intensity of the expressions; T s and
T eare the starting time and the ending time of the cycle; Do,
Dr are onset duration and offset duration of the expressions, respectively The process in which the expressions occur in a cycle is described as a function of time:
Trang 4where φ+and φ−are the functions that describe the onset and
offset phase of expressions
φ+(x, Do) = exp(ln2Dox) − 1
φ−(x, Dr) = exp(ln2 −ln2−ln(Pa+1)
Dr x) − 1
In order to verify the reasonableness of the pre-defined
temporal patterns, we performed the fitting task for all
tempo-ral AU profiles If the distance between the centers of two eyes
was normalized to 1, the sum of squares due to error (SSE) of
the fit was 0.0207 Performing the fitting task for all temporal
AU profiles, we found that the average of the sum of squares
due to error (SSE) was 0.055 with the standard deviation was
0.078 These values showed that the above temporal patterns
and the fitting function were reasonable
Basing on the temporal patterns, we proposed a scheme
illustrated in Figure 5 to improve the conversion of continuous
emotional states of an agent to facial expressions The idea was
that the facial expressions happen in series with decreasing
intensity when a corresponding emotion is triggered For
example, when an event happens that triggered the happiness
of a person, he/she would not smile in full intensity during
the time the happiness lasts Instead, he/she would express a
series of smiles in decreasing intensity Thus, emotional facial
expressions appear only when there is a significant stimuli that
changes the emotional states, otherwise, the expressions in the
face is kept at low level displaying moods rather than emotions
even when the intensities of emotions are high The emotional
expressions will not stay on the face for a long time while
emotions decay slowly However, the expressions of moods
can last for much longer time on the face
Figure 5: The scheme to convert continuous emotional states
of an agent to facial expressions
The Expression Mode Selection adjusts the series of ESV
over time so that corresponding facial expressions happen
temporally in the way similar to the temporal patterns This
module determines whether an emotional facial expression
should be generated to express the current emotional state or
the expressions in the 3D face kept at low level displaying
moods rather than emotions It firstly checks if there is a
significant increase in the intensity of any emotion during last
Ti seconds (the duration of an emotional expression cycle),
that is if:
ex
i −ex−1
i > θ where t−Ti≤x ≤ t, t is the current time, and θ is the threshold
to activate emotional facial expressions (According to analytic
results on the video database, Ti has value of about 3.5 for
happiness, 5.3 for sadness, 3.6 for disgust emotion, 3 for angry
and fear emotions, and 2.7 for surprise emotion) If there is
a significant change, the ESV is converted directly to FMCV using the fuzzy rule based system proposed in [19]; and the cycle − tagi is set to 1 for happy and sad emotions, is set to
3 for fear, angry, surprise, and disgust emotions If not, the ESV is normalized as follows: t′
i is the time at which the most recent cycle ends, t is the current time,
• if cycle − tagi= 1 and t′i+ 3 ≤ t ≤ ti′+ 3 + Ti∗0.8then
et
i= eti∗0.8and cycle − tagi= 2
• if cycle − tagi= 2 and t′i+ 3 ≤ t ≤ ti′+ 3 + Ti∗0.6then
eti= eti∗0.6and cycle − tagi= 3
• otherwise, et
i is normalized to lower intensity In this way, the emotions are displayed as moods, the low-intensity and long-lasting state of emotions
After being normalized, the EVS is converted to FMCV using the same fuzzy rule based system [19]
C Combination module
The Combination module creates lip movements when pronouncing Vietnamese emotional speech from the list of phonemes with temporal information and combines these movements with emotional facial expressions from EFE mod-ule
Visemes for Vietnamese phonemes.
In order to create lip movements during speaking Viet-namese, firstly we need to have a set of visemes for the face, corresponding with Vietnamese phonemes Similar to our pre-vious work [20], we follow the rules in [21] and [22] to specify the correlative visemes with individual Vietnamese phonemes According to [21], Vietnamese phonemes are divided into two types: vowel and consonant About the visemes of the vowels, these phonemes are separated as well as expressed according to three main factors: the position of the tongue, the open degree
of the mouth, and the figure of the lips On the open of the mouth, the vowels are divided into four categories: close vowel
(i), semi-close vowel (ê), semi-open vowel (e), and open vowel (a).The narrow - wide property of the vowels is specified by
the gradually widening of the mouth On the figure of the lips, the vowels are separated into two types: round-lip vowel (o,
ô)and unround-lip vowel (ơ) The round or unround property
is determined by the shape of the lips Figure 6 show the relationships between vowels and the above two properties The horizontal lines express the open degree of the mouth The vertical lines express the shape of the lips; the left part shows unround-lip vowels, the right part shows the round-lip vowels About the visemes of consonants, these phonemes are separated as well as expressed according to two main factors: where and how phonemes are pronounced According to the first factor, consonants are divided in to three types: lip
-consonant (b, p,v, ph), tongue - -consonant (đ, ch, c,k), and fauces - consonant (h).
Because the 3D face model [10] which we use simulates the effect of vector muscle, sphincter muscle and jaw rotation,
it can display facial movements during Vietnamese speech The open-degree of the mouth corresponds with the amount
of the jaw rotation, and the round-degree of the lips depends
on muscles which affect on lips For simplicity, some vowels which are fairly similar are added up into one group In order
Trang 5Figure 6: Vowel trapezium.
to create the vowel visemes, the amount of the jaw rotation
and the contraction degree of muscles affecting on lips are
originally determined basing on the vowel trapezium After
that, these values are manually refined relying on comparisons
between the vowel visemes of the 3D face and the vowel
visemes of real human face To create visemses for consonants,
we care about only positions where phonemes are pronounced
According to this factor, we divided consonants into three
types: lip - lip consonant, lip - tooth consonant, and the last
type including the remaining consonants We follow the rules
in [21] and [22] to create original visemes for consonants And
after that we also manually refine these as we do with vowel
visemes
Combination of lip movements when talking.
Human speeches are always paragraph, sentence, or some
words These include a set of phonemes, some phonemes form
a word With each single phoneme, we already have a specific
viseme Now the request is that make the movement from
one viseme (e.g V1) to another viseme (e.g V2) gradual and
smooth in order to make lip movement during speech realistic
The most simple way is creating intermediate visemes of V1
and V2 by adding V1 and V2’s correlative parameter values
and thence taking the average values However, this way is
not a really good choice because the articulation of a speech
segment is not self-contained, it depends on the preceding and
upcoming segments In our approach, we apply the dominance
model [23] to create the coarticulation effect on lip
move-ments when talking Coarticulation is the blending effect that
surrounding phonemes have on the current phonemes In [23],
a lip movement correlative to a speech segment is represented
as a viseme segment Each viseme segment has dominance
over the vocal articulators which increase and decrease over
time during articulation This dominance function specifies
how close the lips come to reaching their target values of the
viseme A blending over time of the articulations is created
by overlapping dominance functions of adjacent movements
corresponding to articulatory commands Each movement has
a set of dominance functions, one for each parameter Different
dominance functions can overlap for a given movement The
weighted average of all the co-occurrent dominance functions
produce the final lip shape
Combination of emotional expression and facial movement
during speech.
In order to combine emotional facial expressions (the
output of EFE module) and above lip movements during
speaking Vietnamese, we apply the research proposed in [24]
They divided facial movements into groups called channels
according their type such as emotion displays, lip movements when talking, etc They then presented the schemes for combi-nation of movements in one channel and for combicombi-nation in dif-ferent channels These schemes have the resolution of possible conflicting muscles to eliminate unnatural facial animations
At a certain time, when there is a conflict between parameters
in different animation channels, the parameters involved in the movement with higher priority will dominate the ones with lower priority In our talking face, we give higher priority to the lip movement when talking The final facial animations resulted from combination are displayed on the 3D talking face with synchronized synthesized emotional speech
III EVALUATION
In order to test our talking face, we use ParleE - an emotion model for a conversational agent [25], and put the face in the football supporter domain [26] ParleE is a quantitative, flexible and adaptive model of emotions in which appraising events is based on learning and a probabilistic planning algo-rithm ParleE also models personality and motivational states and their roles in determining the way the agent experiences emotion This model was developed in order to enable con-versational agents to respond to events with the appropriate expressions of emotions with different intensities We put the face in the domain of a football supporter [26] because football
is an emotional game; there are many events in the the game that trigger emotions of not only players but also of coaches, supporters, etc Testing the face with the football supporter’s domain gives us the chance to test many emotions as well as the dynamics of emotions because the actions in a football match happen fast Our talking face roles as the face of a football (soccer) supporter agent The agent is watching a football match in which a team, which he supports, is playing The agent can experience different emotions by appraising events based on his goals, standards, and preferences Then the emotions are showed on the face and in the voice of our talking face In short, the purpose of using ParleE and football supporter domain is to provide good input to test our talking face
Figure 7: A picture of the talking face
Figure 7 illustrates a picture of our talking face We have performed an experiment to collect evaluation of its ability
to express continuous emotional states Following Katherine Isbister and Patrick Doley [27], we selected user test method for evaluating experiments related to emotions and facial expressions To obtain the user’s assessment, we showed them
Trang 6Figure 8: Summary of interview results from the user test
a clip of the talking face and then asked them to answer some
questions The face was tested with 20 users (10 males and
10 females) aged between 15 and 35 with an average age
of 27 years Each user test session took about 17 minutes
Sessions began with a brief introduction to the experiment
process and the talking face During the next 7 minutes, the
user watched a short clips of the face Finally, each user was
interviewed separately about his/her assessment on the face
We asked a total of four questions as showed in Figure 8
According to users’ assessment, the talking face was able to
express emotions on the face and in the voice quite naturally
IV CONCLUSION
We have presented a three dimensional face with ability
to speak emotional Vietnamese speech and naturally express
emotions while talking Our face can represent lip movements
during emotionally pronouncing Vietnamese words, and at the
same time it can show emotional facial expressions while
speaking We have tested the face in the football supporter
domain in order to confirm its naturalness The face was
simulated as the face of a football supporter which experiences
emotions and expresses emotional expressions in his voice as
well as on his face The experiment results show that our
talking face is able to express emotions on the face and in
the voice quite naturally
REFERENCES
[1] R Picard, Affective Computing MIT Press, Cambridge, MA, 1997.
[2] D H Galernter, The muse in the machine. Free Press, New York,
1994.
[3] A R Damasio, Descartes’ error: Emotion, reason, and the human
brain. G.P Putnam, New York, 1994.
[4] F Thomas and O A Johnsto, The Illusion of Life. Abbeville Press,
New York, 1981.
[5] C Pelachaud, “Modelling multimodal expression of emotion in a virtual
agent.” Philosophical Transactions of the Royal Society B: Biological
Sciences, vol 6364, no 1535, pp 3539–3548, 2009.
[6] M C Prasetyahadi, I R Ali, A H Basori, and N Saari, “Eye, lip
and crying expression for virtual human,” International Journal of
Interactive Digital Media, vol 1(2), 2013.
[7] G Collier, Emotional expression Lawrence Erlbaum Associates, New
Jersey, 1985.
[8] T D Ngo, M Akagi, and T D Bui, “Toward a rule-based synthesis
of vietnamese emotional speech,” in Proc of the Sixth International
Conference on Knowledge and Systems Engineering (KSE 2014), pp.
129–142.
[9] T D Ngo, T H N Vu, V H Nguyen, and T D Bui, “Improving
simulation of continuous emotional facial expressions by analyzing
videos of human facial activities,” in Proc of the 17th International
Conference on Principles and Practice of Multi-Agent Systems (PRIMA
2014), pp 222–237.
[10] T D Bui, D Heylen, and A Nijholt, “Improvements on a simple
muscle-based 3d face for realistic facial expressions,” in Proc
CASA-2003, 2003, p 33–40.
[11] H Kawahara, I Masuda-Katsuse, and A de Cheveigne, “Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a
repetitive structure in sounds,” Speech Communication, vol 27, pp 187–
207, 1999.
[12] “Wavesurfer: http://www.speech.kth.se/wavesurfer/index.html.” [13] M Pantic, M Valstar, R Rademaker, and L Maat, “Web-based database
for facial expression analysis.” Proc 13th ACM Int’l Conf Multimedia
and Expo, pp 317–321, 2005.
[14] F Wallhoff, “The facial expressions and emotions database homepage
(feedtum).” www.mmk.ei.tum.de/ waf/fgnet/feedtum.html.
[15] S Mohammad Mavadati, H M Mohammad, K Bartlett, P Trinh, and
J F Cohn, “Disfa: A spontaneous facial action intensity database,”
IEEE Transactions on Affective Computing, vol 4, no 2, pp 151–160, 2013.
[16] P Viola and M Jones, “Robust real-time object detection„” Tech.
rep.,Cambridge Research Laboratory Technical report series., no 2, 2001.
[17] T Cootes, C Taylor, D Cooper, and J Graham, “Active shape
models-their training and application.” Computer Vision and Image
Understand-ing, vol 61, no 1, pp 38–59, 1995.
[18] P Ekman and W V Friesen, Facial Action Coding System Palo Alto,
CA: Consulting Psychologists Press, 1978.
[19] T D Bui, D Heylen, M Poel, and A Nijholt, “Generation of facial
ex-pressions from emotion using a fuzzy rule based system,” in Australian
Joint Conf on Artificial Intelligence (AI 2001) Berlin: Lecture Notes
in Computer Science, Springer, 2001, pp 83–95.
[20] T D Ngo and N L Tran and Q K Le and C H Pham and L H.
Bui, “An approach for building a vietnamese talking face,” Journal on
Information and Communication Technologies, no 6(26), 2011.
[21] X T Đỗ and H T Lê, Giáo trình tiếng Việt 2 Nhà xuất bản đại học
Sư Phạm, 2007.
[22] T L Nguyễn and T H Nguyễn, Tiếng Việt (Ngữ âm và Phong cách
học). Nhà xuất bản đại học Sư Phạm, 2007.
[23] M M Cohen and D W Massaro, “Modeling coarticulation in synthetic
visual speech,” in Models and Techniques in Computer Animation, pp.
139–156.
[24] T D Bui, D Heylen, and A Nijholt, “Combination of facial movements
on a 3d talking head,” in Proc of the Computer Graphics International,
2004, pp 284 – 290.
[25] ——, “Parlee: An adaptive plan-based event appraisal model of
emo-tions,” in In Proc KI 2002: Advances in Artificial Intelligence, p.
129–143.
[26] ——, “Building embodied agents that experience and express emotions:
A football supporter as an example,” in Proceedings 17th annual
conference on Computer Animation and Social Agents (CASA2004), 2004.
[27] K Isbister and P Doyle, “Design and evaluation of embodied
conver-sational agents: a proposed taxonomy,” in In Proceedings of AAMAS
2002 Workshop on Embodied Conversational Agents: Let’s Specify and Evaluate Them!, Bologna, Italy, 2002.