Realistic Face Animation for Speech pptx

Leuven, Belgium vangool@vision.ee.ethz.ch Keywords: face animation, speech, visemes, eigen space, realism Abstract Realistic face animation is especially hard as we are all experts in th

Trang 1

Realistic Face Animation for Speech

Gregor A Kalberer Computer Vision Group ETH Z ¨urich, Switzerland

kalberer@vision.ee.ethz.ch

Luc Van Gool Computer Vision Group ETH Z ¨urich, Switzerland ESAT / VISICS, Kath Univ Leuven, Belgium

vangool@vision.ee.ethz.ch

Keywords:

face animation, speech, visemes, eigen space, realism

Abstract

Realistic face animation is especially hard as we are all experts in the perception and interpretation

of face dynamics One approach is to simulate facial anatomy Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according

to the required performance This is the strategy followed by the paper, which focuses on speech The approach follows a kind of bootstrap procedure First, 3D shape statistics are learned from a talking face with a relatively small number of markers A 3D reconstruction is produced at temporal intervals

of 1/25 seconds A topological mask of the lower half of the face is fitted to the motion Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space The result is two-fold On the one hand, the face can be animated, in our case it can be made to speak new sentences On the other hand, face dynamics can be tracked in 3D without markers for performance capture.

Trang 2

Realistic face animation is a hard problem Humans will typically focus on faces and are incrediblygood at spotting the slightest glitch in the animation On the other hand, there is probably no shape moreimportant for animation than the human face Several applications come immediately to mind, such asgames, special effects for the movies, avatars, virtual assistants for information kiosks, etc This paperfocuses on the realistic animation of the mouth area for speech

Face animation research dates back to the early 70’s Since then, the level of sophistication has

increased dramatically For example, the human face models used in Pixar’s Toy Story had several

thousand control points each [1] Methods can be distinguished by mainly two criteria On the one hand,there are image and 3D model based methods The method proposed here uses 3D face models On theother hand, the synthesis can be based on facial anatomy, i.e both interior and exterior structures of aface can be brought to bear, or the synthesis can be purely based on the exterior shape The proposedmethod only uses exterior shape By now, several papers have appeared for each of these strands Acomplete discussion is not possible, so the sequel rather focuses on a number of contributions that areparticularly relevant for the method presented here

So far, for reaching photorealism one of the most effective approaches has been the use of 2D phing between photographic images [2, 3, 4] These techniques typically require animators to specify

mor-carefully chosen feature correspondences between frames Bregler et al [5] used morphing of mouth

regions to lip-synch existing video to a novel sound-track This Video Rewrite approach works largelyautomatically and directly from speech The principle is the re-ordering of existing video frames It is

of particular interest here as the focus is on detailed lip motions, incl co-articulation effects betweenphonemes But still, a problem with such 2D image morphing or re-ordering techniques is that they

do not allow much freedom in the choice of face orientation or compositing the image with other 3Dobjects, two requirements of many animation applications

In order to achieve such freedom, 3D techniques seem the most direct route Chen et al [6] applied

3D morphing between cylindrical laser scans of human heads The animator must manually indicate a

Trang 3

number of correspondences on every scan Brand [7] generates full facial animations from expressiveinformation in an audio track, but the results are not photo-realistic yet Very realistic expressions have

been achieved by Pighin et al [8] They present face animation for emotional expressions, based on

linear morphs between 3D models acquired for the different expressions The 3D models are created

by matching a generic model to 3D points measured on an individual’s face using photogrammetrictechniques and interactively indicated correspondences Though this approach is very convincing forexpressions, it would be harder to implement for speech, where higher levels of geometric detail are

required, certainly on the lips Hai Tao et al [9] applied a 3D facial motion tracking based on a

piece-wise bezier volume deformation model and manually defined action units to track and synthesize visual´

speech subsequently Also this approach is less convincing around the mouth, probably because only afew specific feature points are tracked and used for all the deformations Per contra L Reveret et al [10]have applied a sophisticated 3D lip model, which is represented as a parametric surface guided by 30control points Unfortunately the motion around the lips, which is also very important for increasedrealism, was tracked by only 30 markers on one side of the face and finally mirrored Knowing that most

of the people talks spacially unsymetric, the chosen approach results in a very symmetric and not verydetailed animation

Here, we present a face animation approach that is based on the detailed analysis of 3D face shapesduring speech To that end, 3D reconstructions of faces have been generated at temporal sampling rates

of 25 reconstructions per second A PCA analysis on the displacements of a selection of control pointsyiels a compact 3D description of visemes, the visual counterparts of phonemes With 38 points on thelips themselves and a total of 124 on the larger part of the face that is influenced by speech, this analysis

is quite detailed By directly learning the facial deformations from real speech, their parameterisation interms of principal components is a natural and perceptually relevant one This seems less the case foranatomically based models [11, 12] Concatenation of visemes yields realistic animations In addition,the results yield a robust face tracker for performance capture, that works without special markers.The structure of the paper is as follows The first Section describes how the 3D face shapes are ac-quired that are observed during speech and how these data are used to analyse the space of corresponding

Trang 4

face deformations Whereas the second Section uses these results in the context of performance capture,the third section discusses the use for speech-based animation of a face for which 3D lip dynamics havebeen learned and for those to which the learned dynamics were copied A last Section concludes thepaper.

The Space of Face Shapes

Our performance capture and speech-based animation modules both make use of a compact terisation of real face deformations during speech This section describes the extraction and analysis ofthe real, 3D input data

parame-Face Shape Acquisition

When acquiring 3D face data for speech, a first issue is the actual part of the face to be measured.The results of Munhall and Vatikiotis-Bateson [13] provide evidence that lip and jaw motions affect theentire facial structure below the eyes Therefore, we extract 3D data for the area between the eyes andthe chin, to which we fit a topological model or ‘mask’, as shown in fig 1

This mask consists of 124 vertices, the 34 standard MPEG-4 vertices and 90 additional vertices forincreased realism Of these vertices, 38 are on the lips and 86 are spread over the remaining part of themask The remainder of this section explores the shapes that this mask takes on if it is fitted to the face

of a speaking person The shape of a talking face was extracted at a temporal sampling rate of 25 3Dsnapshots per second (video) We have used Eyetronics’ ShapeSnatcher system for this purpose [14]

It projects a grid onto the face, and extracts the 3D shape and texture from a single image By using avideo camera, a quick succession of 3D snapshots can be gathered The ShapeSnatcher yields severalthousand points for every snapshot, as a connected, triangulated and textured surface The problem isthat these 3D points correspond to projected grid intersections, not corresponding, physical points of theface We have simplified the problem by putting markers on the face for each of the 124 mask vertices,

as shown in fig 2

Trang 5

The 3D coordinates of these 124 markers (actually of the centroids of the marker dots) were measuredfor each 3D snapshot, through linear interpolation of the neighbouring grid intersection coordinates.This yielded 25 subsequent mask shapes for every second One such mask fit is also shown in fig 2.The markers were extracted automatically, except for the first snapshot, where the mask vertices werefitted manually to the markers Thereafter, the fit of the previous frame was used as an initialisation forthe next, and it was usually sufficient to move the mask vertices to the nearest markers In cases wherethere were two nearby candidate markers the situation could almost without exception be disambiguated

by first aligning the vertices with only one such candidate

Before the data were extracted, it had to be decided what the test person would say during the tion It was important that all relevant visemes would be observed at least once, i.e all visually distinctmouth shape patterns that occur during speech Moreover, these different shapes should be observed in

acquisi-as short a time acquisi-as possible, in order to keep processing time low The subject wacquisi-as acquisi-asked to pronounce

a series of words, one directly after the other as in fluent speech, where each word was targeting oneviseme These words are given in the table of fig 5 This table will be discussed in more detail later

Face Shape Analysis

The 3D measurements yield different shapes of the mask during speech A Principal ComponentAnalysis (PCA) was applied to these shapes in order to extract the natural modes The recorded datapoints represent 372 degrees of freedom (124 vertices with three displacements each) Because only 1453D snapshots were used for training, at most 144 components could be found This poses no problem as98% of the total variance was found to be represented by the first 10 components or ‘eigenmasks’, i.e.the eigenvectors with the 10 highest eigenvalues of the covariance matrix for the displacements Thisleads to a compact, low-dimensional representation in terms of eigenmasks It has to be added that sofar we have experimented with the face of a single person Work on automatically animating faces ofpeople for whom no dynamic 3D face data are available is planned for the near future Next, we describethe extraction of the eigenmasks in more detail

The extraction of the eigenmaks follows traditional PCA, applied to the displacements of the 124

Trang 6

selected points on the face This analysis cannot be performed on the raw data, however First, the maskposition is normalised with respect to the rigid rotation and translation of the head This normalisation

is carried out by aligning the points that are not affected by speech, such as the points on the upper side

of the nose and the corners of the eyes After this normalisation, the 3D positions of the mask verticesare collected into a single vector mk for every frame k= 1 N , with N = 145 in this case

mk = (xk1, yk1, zk1, , xk124, yk124, zk124)T

(1)whereT stands for the transpose Then, the average mask ¯m

¯

m= 1N

N

X

k=1

is subtracted to obtain displacements with respect to the average, denoted as ∆mk = mk- ¯m The

covariance matrix Σ for the displacements is obtained as

one obtains the PCA decomposition withΛ the diagonal scaling matrix with the eigenvalues λ sorted

from the largest to the smallest magnitude, and the columns of the rotation matrix R the correspondingeigenvectors The eigenvectors with the highest eigenvalues characterize the most important modes offace deformation Mask shapes can be approximated as a linear combination of the 144 modes

The weight vector wj describes the deviation of the mask shape mj from the average mask ¯m in

terms of the eigenvectors, coined eigenmasks for this application By varying wj within reasonablebounds, realistic mask shapes are generated As already mentioned at the beginning of this section, itwas found that most of the variance (98%) is represented by the first 10 modes, hence further use of theeigenmasks is limited to linear combinations of the first 10 They are shown in fig 3

Trang 7

Performance Capture

A face tracker has been developed, that can serve as a performance capture system for speech It fits

the face mask to subsequent 3D snapshots, but now without markers Again, 3D snapshots taken with

the ShapeSnatcher at 1/25 second intervals are the input The face tracker decomposes the 3D motionsinto rigid motions and motions due to the visemes

The tracker first adjusts the rigid head motion and then adapts the weight vector wjto fit the remainingmotions, mainly those of the lips A schematic overview is given in fig 4(a) Such performance capturecan e.g be used to drive a face model at a remote location, by only transmitting a few face animationparameters: 6 parameters for rigid motion and 10 components of the weight vectors

For the very first frame, the system has no clue where the face is and where to try fitting the mask Inthis special case, it starts by detecting the nose tip It is found as a point with particularly high curvature

in both horizontal and vertical direction:

n(x, y) = {(x, y)|min(max(0, kx), max(0, ky)) is maximal} (6)

where kx and ky are the two curvatures, which are in fact averaged over a small region around thepoints in order to reduce the influence of noise The curvatures are extracted from the 3D face dataobtained with the ShapeSnatcher After the nose tip vertex of the mask has been aligned with the nosetip detected on the face, and with the mask oriented upright, the rigid transformation can be fixed byaligning the upper part of the mask with the corresponding part of the face After the first frame, theprevious position of the mask is normally close enough to directly home in on the new position with therigid motion adjustment routine alone

The rigid motion adjustment routine focuses on the upper part of the mask as this part hardly deformsduring speech The alignment is achieved by minimizing distances

between the vertices of this part of the mask and the face surface In order not to spend too muchtime on extracting the true distances, the cost Eo of a match is simplified Instead, the distances aresummed between the mask vertices x and the points p where lines through these vertices and parallel tothe viewing direction of the 3D acquisition system hit the 3D face surface:

Trang 8

Eo = X

i∈{ upper part }

di ; di = kpi− xi(w)k ; (7)Note that the sum is only over the vertices in the upper part of the mask The optimization is performedwith the downhill simplex method [15], with 3 rotation angles and 3 translation components as parame-ters Fig 4 gives an example where the mask starts from an initial position (b) and is iteratively rotatedand translated to end up in the rigidly adjusted position (c)

Once the rigid motion has been canceled out, a fine-registration step deforms the mask in order toprecisely fit the instantaneous 3D facial data due to speech To that end the components of the weightvector w are optimised Just as is the case with face spaces [16], PCA also here brings the advantagethat the dimensionality of the search space is kept low Again, a downhill simplex procedure is used

to minimize a cost function for subsequent frames j This cost function is of the same form as eq (7),with the difference that now the distance for all mask vertices is taken into account (i.e also for thenon-rigidly moving parts) Each time starting from the previous weight vector wj−1(for the first framestarting with the average mask shape, i.e wj−1= 0 ), an updated vector wj is calculated for the frame athand These weight vectors have dimension 10, as only the eigenmasks with the 10 largest eigenvaluesare considered (see section ) Fig 4(d) shows the fine registration for this example

The sequence of weight vectors – i.e mask shapes – extracted in this way can be used as a performancecapture result, to animate the face and reproduce the orignal motion This reproduced motion stillcontains some jitter, due to sudden changes in the values of the weight vector’s components Therefore,these components are smoothed with B-splines (of degree 3) These smoothed mask deformations areused to drive a detailed 3D face model, which has many more vertices than the mask For the animation

of the face vertices between the mask vertices a lattice deformation was used (Maya,DEFORMER -TYPE WRAP)

Fig 8 shows some results The first row (A) shows different frames of the input video sequence Theperson says “Hello, my name is Jlona” The second row (B) shows the 3D ShapeSnatcher output, i.e theinput for the performance capture The third row (C) shows the extracted mask shapes for the same timeinstances The fourth row (D) shows the reproduced expressions of the detailed face model as driven by

Trang 9

the tracker.

Animation

The use of performance capture is limited, as it only allows a verbatim replay of what has been

observed This limitation can be lifted if one can animate faces based on speech input, either as an audio

track or text Our system deals with both types of input

Animation of speech has much in common with speech synthesis Rather than composing a sequence

of phonemes according to the laws of co-articulation to get the transitions between the phonemes right,

the animation generates sequences of visemes Visemes correspond to the basic, visual mouth

expres-sions that are observed in speech Whereas there is a reasonably strong consensus about the set ofphonemes, there is less unanimity about the selection of visemes Approaches aimed at realistic anima-tion of speech have used any number from as few as 16 [2] up to about 50 visemes [17] This number

is by no means the only parameter in assessing the level of sophistication of different schemes Muchalso depends on the addition of co-articulation effects There certainly is no simple one-to-one relationbetween the 52 phonemes and the visemes, as different sounds may look the same and therefore thismapping is rather many-to-one For instance\b\ and\p\ are two bilabial stops which differ only inthe fact that the former is voiced while the latter is voiceless Visually, there is hardly any difference influent speech

We based our selection of visemes on the work of Owens [18] for consonants We use his consonantgroups, except for two of them, which we combine into a single\k,g,n,l,ng,h,y\viseme Thegroups are considered as single visemes because they yield the same visual impression when uttered

We do not consider all the possible instances of different, neighboring vocals that Owens distinguishes,however In fact, we only consider two cases for each cluster: rounded and widened, that representthe instances farthest from the neutral expression For instance, the viseme associated with\m\ differsdepending on whether the speaker is uttering the sequence omoorumuvs the sequence emeorimi

In the former case, the\m\viseme assumes a rounded shape, while the latter assumes a more widened

Trang 10

shape Therefore, each consonant was assigned to these two types of visemes For the visemes thatcorrespond to vocals, we used those proposed by Montgomery and Jackson [19].

As shown in fig 5, the selection contains a total of 20 visemes: 12 representing the consonants (boxeswith red ‘consonant’ title), 7 representing the monophtongs (boxes with title ‘monophtong’) and onerepresenting the neutral pose (box with title ‘silence’), where diphtongs (box with title ‘diphtong’) aredivided into two seperate monophtongs and their mutual influence is taken care of as a co-articulationeffect The boxes with smaller title ‘allophones’ can be discarded by the reader for the moment Thetable also contains example words producing the visemes when they are pronounced

This viseme selection differs from others proposed earlier It contains more consonant visemes thanmost, mainly because the distinction between the rounded and widened shapes is made systematically.For the sake of comparison, Ezzat and Poggio [2] used 6 (only one for each of Owens’ consonant

groups while also combining two of them), Bregler et al [5] used 10 (same clusters but they subdivided

the cluster\t,d,s,z,th,dh\into\th,dh\and the rest, and\k,g,n,l,ng,h,y\into\ng\,

\h\, \y\, and the rest, making an even more precise subdivision for this cluster), and Massaro [20]used 9 (but this animation was restriced to cartoon-like figures, which do not show the same complexity

as real faces) We feel that our selection is a good compromise between the number of visemes needed

in the animation and the realism that is obtained

Animation can then be considered as navigating through a graph where each node represents one

of NV visemes, and the interconnections between nodes represent the NV2 viseme transformations articulation) From an animator’s perspective, the visemes represent key masks, and the transformationsrepresent a method of interpolating between them As a preparation for the animation, the visemes weremapped into the 10-dimensional eigenmask space This yields one weight vector wvisfor every viseme.The advantage of performing the animation as transitions between these points in the eigenmask space,

(co-is that interpolated shapes all look real(co-istic As was the case for tracking, point to point navigation in theeigenmask space as a way of concatenating visemes yields jerky motions Moreover, when generatingthe temporal samples, these may not precisely coincide with the pace at which visemes change Bothproblems are solved through B-spline fitting to the different components of the weight vectors wvis(t)

Trang 11

with t time, as illustrated in fig 6, which yields trajectories that are smooth and that can be sampled asdesired.

As input for the animation experiments we have used both text and audio The visemes which have

to be visited, the order in which this should happen, and the time intervals in between can be calculatedfrom a pure audio track containing speech First a file is generated that contains the ordered list of allo-phones and their timing ‘Allophones’ correspond to a finer subdivision of phonemes This transcriptionhas not been our work and we have used an existing tool, described in [21] The allophones are thentranslated into visemes The vocals and ’silence’ are directly mapped to the viseme in the box imme-diately to their left in fig 5 For the consonants the context plays a role If they immediately follow avocal among \o\, \u\, and \@@\ (this is the vocal as in ‘bird’), then the allophone is mapped onto

a rounded consonant (the corresponding box in the left column of fig 5) If the vocal is among \i\,

\a\, and\e\ then the allophone is mapped onto a widened consonant (the corresponding box in theright column of fig 5) When the consonant is not preceded immediately by a vocal, but the subsequentallophone is one, then a similar decision is made If the consonant is flanked by two other consonants,the preceding vocal decides

Once the sequence of visemes and their timing are available, the mask deformations are determined.The mask then drives the detailed face model Fig 8 (E) shows a few snapshots of the animated headmodel, for the same sentence as used for the performance capture example Row (F) shows a detail ofthe lips for another viewing angle

It is of course interesting at this point to test what the result would be of verbatim copying of thevisemes onto another face If successful, this would mean that no new lip dynamics have to be capturedfor that face and much time and efford could be saved Such result are shown in fig 7 Although thesestatic images seem resonable, the corresponding sequences are not really satisfactory

Trang 12

Realsitic face animation is still a hard nut to crack We have tried to attack this problem via theacquisition and analysis of exterior, 3D face measurements With 38 points on the lips alone and afurther 86 around the mouth region to cover all parts of the face that are influenced by speech, it seemsthat this analysis is more detailed than earlier ones Based on a proposed selection of visemes, speechanimation is approached as the concatenation of 3D mask deformations, expressed in a compact space

of ‘eigenmasks’ Such approach was also demonstrated for performance capture

This work still has to be extended in a number of ways First, the current animation suite only supportsanimation of the face of the person for whom the 3D snapshots were acquired Although we have tried

to transplant visemes onto other people’s faces, it became clear, that a really realistic animation requiresvisemes that are adapted to the shape or ’physiognomy’ of the face at hand Hence one cannot simplycopy the deformations that have been extracted from one face to a novel face It is not precisely know

at this point how the visemes deformations depend on the physiognomy, but ongoing experiments havealready shown that adaptations are possible without a complete relearning of the face dynamics

Secondly, there are still unnatural effects with some co-articulations between subsequent consonants.Although Massaro [22] has suggested to use a finite inventory of visemes rather than an approach with

a hugh amount of disemes, these effects have to be removed through the refinement of the spline tories in the eigenmask space and a more sophisticated dominance model in general Other necessaryimprovements are the rounding of the lips into the mouth cavity which is not yet present because theseparts of the lips are not observed in the 3D data (the reason why the mouth doesn’t close completelyyet), and the addition of wrinkles on the lips and elsewhere, which can be solved by also using thedynamically observed texture data (e.g when the lips are rounded)

trajec-Acknowledgments

This research work has been supported by the ETH Research Council and the by European sion through the IST project MESH (www.meshproject.com 2002) and with the assistance of: Univ

Định dạng
Số trang	25
Dung lượng	2,38 MB