Improving the 3D talking head for using in an avatar of virtual meeting room

Second, the talking head not only has the capabilities to create facial movements such as conversational signals, emotions expressions, etc but also has to combine and solve the conflict

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

COLLEGE OF TECHNOLOGY

ANH DUC NGUYEN

IMPROVING THE 3D TALKING HEAD

FOR USING IN AN AVATAR

OF VIRTUAL MEETING ROOM

Branch: Information Technology

Code: 1.01.10

MASTER THESIS

Supervisor: Dr The Duy Bui

Hanoi, November 2006

Trang 2

List of Figures 3

Chapter 1 - Introduction 5

1.1 The avatar in the virtual meeting room 5

1.2 Structure of this thesis 6

Chapter 2 - The 3D animated talking head 8

2.1 A muscle based 3D face model 8

2.2 Combination of facial movements on a 3D talking head 9

2.3 From emotions to emotional facial expressions 12

2.4 Conclusion 15

Chapter 3 - OpenGL and JOGL overview 16

3.1 OpenGL overview 16

3.1.1 Immediate Mode and Retained Mode (Scene Graphs) 16

3.1.2 OpenGL history 16

3.1.3 How does OpenGL work? 17

3.1.4 OpenGL as a state machine 19

3.1.5 Drawing geometry 20

3.2 JOGL overview 22

3.2.1 Introduction 22

3.2.2 Developing with JOGL 23

3.2.3 Using JOGL 24

3.3 Conclusion 25

Chapter 4 - Improving lip-sync ability 26

4.1 Introduction 26

4.2 Previous work 27

4.3 FreeTTS and Mbrola 28

4.3.1 FreeTTS 28

4.3.2 Mbrola 31

4.4 The improved lip model 32

4.5 Conclusion 35

Chapter 5 - Adding the hair and eyelashes models 36

5.1 Introduction 36

Trang 3

5.2 The Hair model 37

5.2.1 Introduction to VRM L 37

5.2.2 Our hair model 39

5.3 The Eyelashes model 42

5.4 Conclusion 44

Chapter 6 - Implementation and illustrations 45

6.1 Implementing the face model 45

6.1.1 Structure of the system 45

6.1.2 Some improvements 46

6.2 Face model illustrations 47

Chapter 7 - Conclusion 56

Future research 56

References 58

Trang 4

L i s t o f F i g u r e s

2.1: The original 3D face model: (a): The face mesh with muscles; (b): The

face after rendering 9

2.2: System overview 10

2.3: Combination of two movements in the same channel 11

2.4: The activity of Zygomatic Major and Orbicularis Oris before (top) and after (bottom) applying combination algorithm 11

2.5: The emotion-to-expression system 12

2.6: Membership functions for emotion intensity (a) and muscle contraction level (b ) 13

2.7: Basic emotions: neutral, Sadness, Happiness, Anger, Fear, Disgust, Surprise (from left to right) 15

3.1: Software implementation of OpenGL 18

3.2: Hardware implementation of OpenGL 18

3.3: A simplified version of OpenGL pipeline 19

3.4: The structure of an application using JOGL 25

4.1: FreeTTS Architecture 29

5.1: Dividing a polygon (a) to triangles (b) 40

5.2: Importing the hair model: (a): the original head; (b): the head with the imported hair model; (c): the head with the imported and fine tuned hair model 41

5.3: Some other imported and fine tuned hair models 41

5.4: The open (a) and close eyes (b) without and with eyelashes 43

5.5: The face without (a) and with (b), (c) the hair and eyelashes models 44

6.1: The main interface of our program 47

6.2: The face model displays Happiness emotion with maximum intensity 48

6.3: The face model displays Surprise emotion with maximum intensity 48

3

Trang 5

6.4: The combination of two emotions: Happiness and Surprise 49

6.5: The effect of left Zygomatic Major muscle’s contraction at maximum level on the face model 49

6.6: The face model from different view points 50

6.7: Increasing surprise 50

6.8: The hair model after being imported 51

6.9: The hair model after being fine tuned 51

6.10: Some other hair models 52

6.11: Closing the eyes 53

6.12: The face model attach to the body 54

6.13: Our face model embeds into other project 54

4

Trang 6

Chapter 1

Introduction

1.1 The avatar in the virtual meeting room

The Virtual Meeting Rooms (VMRs) are 3D virtual simulations of meeting rooms where the various modalities such as speech, gaze, distance, gestures and facial expressions can be controlled (a VMR project in Twente) The rapid development in computer graphics and embodied conversational agents areas allows the creation of VMRs and makes them to be useful for various purposes These purposes can be divided into three following categories [24], First, they can be used

as a virtual environment for teleconferencing, a real-time communication means for remote participations of meeting [18] Using the VMRs helps to reduce the amount

of data that needs to be sent to and displayed on screens of remote client side In addition, they offer to overcome some features that are problematic in real meetings

or in traditional video-based conferences For examples, the participants can adapt the Virtual Environment to their own preferences without disturbing other people or they can choose a view from any seat in VMRs that they want and feel the comfortable during the meeting [17] Second, VMRs are used to simulate the content of recorded meeting in the different ways or present multimedia information about it Information can be directly recorded from participant’s behaviors in real meetings (e.g tracking of head or body movements, voice) These presentations can

be used as a 3D summary of the real meetings or for evaluating the annotations and results which are obtained by machine learning methods Third, because Virtual Environments allow controlling various independent factors (voice, gaze, distance, gestures, and facial expressions); these factors can be used to study their influence

on features of social interaction and social behavior Conversely, the effect of social interaction on these factors can be studied adequately in Virtual Environments as well

In the VMRs environment, each participant is represented by an avatar An avatar is an embodied conversational agent that simulates all behaviors and movements of the participant The avatar will typically contain a talking head which

is able to speak and displays lip movements during speech, emotional facial

Trang 7

expressions, conversation signals and a body which is able to display gestures of the participant The important thing is the avatar of each participant must bring the belief to other participants The avatar will be believable if it can simulate the appearance, express the characteristics of the participant and its actions and reactions can be as true to life as those of the person it is representing

The talking head model plays an important role in the creation of a believable avatar It is not only used to display facial movements and expressions but also used

to distinguish other avatars and to express the personality of the participant In order

to create a talking head model which is suitable to use for avatar in the VMRs, there are some problems which need to deal with First, the talking head must be simple enough to keep the real-time animation but still produce realistic and high quality facial expressions Second, the talking head not only has the capabilities to create facial movements such as conversational signals, emotions expressions, etc but also has to combine and solve the conflicts between them Third, the talking head must look like real head, it means the head must have other models attached to it such as hair models, tongue model, eyelashes model, etc

In this thesis, we choose the talking head model from [3] to improve and then use for avatars in VMRs We study the model carefully to discover all advantages as

well 93 disadvantages The advantages will be inherited while the missing functions

or disadvantages will be supplemented or improved, respectively We change the rendering method of the head to new one to improve the animation speed The synchronization between audible and visible speech is also improved We supply the hair and the eyelashes models to make the head look more realistic The improved model not only can be used for avatars in VMRs environment but also can be embedded into other projects

1.2 Structure of this thesis

In the Chapter 2, we introduce the 3D animated talking head [3] that our works are based on This head is able to produce realistic facial expressions, real time animation on the personal computer It can display several types of facial movements such as eye blinking, head rotation, lip movement, etc at once and the most important thing is it can generate emotional facial expressions from emotions

We briefly introduce the way this muscle based 3D face model is created, the

Trang 8

It is open sourced, clean and minimalist API from all bindings available.

In the Chapter 4, we introduce an overview of FreeTTS and Mbrola FreeTTS

is a robust text-to-speech system that we used to get phonemes and timing information from a text This phonemes string is used to generate lip movements when speaking FreeTTS supports Mbrola which is a speech synthesizer based on the concatenation of diphones We used Mbrola as an output thread of FreeTTS to produce synthetic audible We also present the method to improve the lip-sync capability The original head can speak but in some conditions the speech from the speaker does not synchrony with the movements of the lip on the screen Besides,

we may want the head to express various emotions depends on current speaking sentence, so we need to know exactly time when the sentence is spoken then we can generate the suitable emotions

The original head does not have hair model and eyelashes We supply these parts in order to make it look like a real head and become more attractive In the Chapter 5, we present the method to apply a hair model for the head and the way we draw eyelashes for the eyes Available hair models will be attached to the head model without much human intervention during process In addition, the eyelashes are a small part on the face but without them, the eyes may not look real The eyelashes also help to improve the emotions expression capability of the eyes when the eyes flutter We describe some problems about the eyelashes creation, and how

to fix them to the eyelid so they can move with the eyelid when the eyes close or open

In the Chapter 6, we introduce the implementation of the face using Java and JOGL We also introduce our improvement in rendering method of the talking head using the new methods and mechanism which are introduced in OpenGL 1.5 This method helps to increase the animation speed significantly Some illustrations of our 3D talking head model are also introduced in this Chapter

Trang 9

Chapter 2

The 3D animated talking head

2.1 A muscle based 3D face model

The face model is created by a polygonal face mesh and a B-spline surface for the lips The face mesh data was obtained from a 3D scanner at first and was processed to improve the animation performance but still kept the high quality of the model The process contains two phases In the first phase, the number of vertices and polygons was reduced in non-expressive parts but maintained in the expressive parts which are the areas around the eyes, the nose, the mouth and the forehead At the end of this phase, the face mesh contains 2,468 vertices and 4,746 polygons This is small enough to have real-time animation but still preserves the high quality of detail in expressive parts of the face In the second phase, the face model was divided into eleven regions Five regions on the left part include of left lower face, left middle face, left lower eyelid, left upper eyelid and left upper face There are five corresponding regions on the right part and the last region is at the back of the head This not only helps to prevent unwanted artifacts generated because of the displacement of the vertices in the regions that should not be affected

by muscle contractions but also increase the animation speed

The lip model is a B-spline surface with 24 x 6 control point grid The lip is deformed by moving the control points and the B-spline surface is polygonalized to connect with the face mesh for rendering The B-spline surface has the advantage of producing a smooth face but it can not produce wrinkles and needs to be polygonalized before rendering If the number of control point is too large then it will require heavy computations Due to these advantage and disadvantage, it is suitable to use B-spline surface for modeling the small part of the face like the lips.Almost all of the 19 muscles, which are used on the face to generate animation, are vector muscles, except Orbicularis Oris that drive the mouth and Orbicularis Oculi that drive the eye The vector muscle of the face is an improved version of the vector muscle model from [28] In addition, a mechanism to generate wrinkles and bulges is added to increase the realistic of the facial expressions and the technique to reduce the computation is also introduced to enhance the animation

Trang 10

The 3D animated talking head

performance The Orbicularis Oris muscle is parameterization-based and is adopted from [12] The Orbicularis Oculi has two parts: the Pars Palpebralis that open and closes the eyelid, is adopted from [22] and Pars Orbitalis that squeezes the eye, is adopted from [28], The jaw and the eyeball rotation algorithms are improved from the ones proposed in [22] The mouth now has a natural oval-looking, and the eyes can track a target Eye movement is independent of facial muscle movements and can not rotate to impossible positions All muscles have the intensity range from 0

to 1, the step value between two adjacent muscle contractions is 0.2 This step value

is determined after trail and error experiments It is small enough to ensure the facial animations are smooth and large enough to decrease the computation times Figure 2.1 shows the original face from [3]

Figure 2.1: The original 3D face model

(a): The face mesh with muscles; (b): The face after rendering

2.2 Combination of facial movements on a 3D talking

head

The system takes as input the marked up text with each facial movement (except lip movement while talking) is defined as a group of muscle contractions that share the same function, start time, onset, offset and duration Lip movement will be generated separately inside the system based on the phonemic presentation

of the input text

Trang 11

Figure 2.2: System overview

There are several types of facial movements on the face They include lip movements when talking, conversational signals, emotion displays, gaze and head movements, and manipulators to satisfy biological requirements of the face All of them can occur at the same time and because they are driven by the muscle models, there can be situations where there are conflicting muscles when two or more movements happen at once Conflicting muscles are muscles that can not contract at the same time For example, when we smile the Zygomatic Major and Minor muscles contract to pull the comer of the lip outward If at that time we concurrently say “Hello”, the phoneme “@U” in the word “Hello” requires the contraction of the Orbicularis Oris muscle which drives the lip into a tight, pursed shape So Zygomatic Major (and Minor) and Orbicularis are conflicting muscles The face must solve this problem to produce natural animation

Each type of facial movements belongs to one channel There are six channels

in the system: manipulators (eye blinking), lip movements (phoneme), conversation signals (muscle contractions), emotion displays (expression), gaze movements (eye movement) and head movements channel The combination process contains two steps In the first step, the movements in each channel are concatenated to generate smooth transactions between adjacent movements In the second step, the movements in all channels are combined and processed to solve “conflicting muscles”

Trang 12

Time (in seconds)

Figure 2.4: The activity of Zygomatic Major and Orbicularis Oris before (top) and

after (bottom) applying combination algorithm

Figure 2.3 is an example about combining two movements in the same channel The muscle’s activity of the first movement happens until time 3, when

Trang 13

there is a stimulus to the second movement, it stops following the first movement and then release to the target value of the muscle in the second movement (0.5), followed by the second movement

Figure 2.4 is an example about combining two movements in different channels Because Zygomatic Major and Orbicularis Oris are conflicting muscles and the Orbicularis muscle has higher priority when it is activated (at time 3), the Zygomatic Major is inhibited However, its activity is adjusted so it does not release too fast which would create an unnatural movement Zygomatic Major activity releases gradually to zero value and then Orbicularis Oris muscle starts contracting

2.3 From emotions to emotional facial expressions

There are six emotions are considered to be universal, this means they associated consistently with the same facial expressions across different cultures These emotions are: Happiness, Anger, Surprise, Fear, Disgust and Sadness [34] Other emotions on the face are considered to be generated by combining six basic emotions above, but rarely more than two emotions occur at the same time So, two aspects of generating emotional facial expressions from emotions are concentrated

to solve First, depending on the intensity of emotion, the face must display the continuous changes in expressions Second, the face must have a method to combine expressions from two emotions A fuzzy rule-based system is suitable for these requirements because it allows incorporating qualitative as well as quantitative information

Single Expression Mode

Trang 14

There are two fuzzy rule-based systems implemented to convert from emotion intensities to muscles contraction levels, which are used to generate emotional expressions on the 3D face model The first fuzzy rule-based system is used to produce contraction levels from single emotion intensity, it is called “Single Expression Mode” The second one is used when two emotion intensity values are converted to muscle contractions levels, it is called “Blend Expression Mode” A mechanism to select Single or Blend Expression Mode is based on the intensities of the emotions felt When a single emotion expresses, the Single mode is chosen The Blend Expression Mode is chosen when more than one emotion expresses, but only the two highest emotion intensity values are used (Figure 2.5)

^ in te n s ity (em o,'onl

Trang 15

The intensity of each emotion is modeled by five fuzzy sets: VeryLovv, Low, Medium, High and VeryHigh Similarly, the contraction level of each muscle is described by five fuzzy sets: VerySmall, Small, Medium, Big and VeryBig By using these fuzzy sets, the system can describe emotions qualitative descriptions like “surprise then lift eyebrows’' and quantitative descriptions like “if the level of sadness is low then draw the eyebrows together; while if the level of sadness is high, then draw the eyebrows together and draw the corners of the lips down.”, etc The form in Figure 2.6 of the membership functions and the support of each membership function are determined after experiments

The rule in the single expression mode looks like following form:

if Sadness is VeryLow then

muscle 9’s contraction level is VerySmall

The rule in blend expression mode looks like following form:

if Surprise is Low and Fear is Medium then

muscle 9’s contraction level is Small

muscle 3’s contraction level is Medium

There are no rules to blend expressions of Happiness and Disgust, as well as Sadness and Surprise, because there is no evidence that these emotions can happen concurrently For these expressions, only the emotion with higher intensity is expressed

Figure 2.7 displays six basic emotion facial expressions which are generated from six corresponding emotions with all intensities are 1 (maximum value) The quality of the facial expressions is improved by using the psychological-based and fairly simple fuzzy rules rather than using other graphic algorithms and complicated formulas or intensively trained Neural Networks

Trang 16

Figure 2.7: Basic emotions: neutral, Sadness, Happiness, Anger, Fear, Disgust,

Surprise (from left to right)

2.4 Conclusion

This 3D face model is suitable for using in the avatar of virtual meeting room because it can display facial expressions with real time animation Beside verbal movements (lip movements when speaking), it can display other non-verbal behaviors such as eye blinking, head rotation, etc It also can generate emotional facial expressions from emotions and can combine different facial movements to display at the same time Not only is the face able to express six basic built-in emotions but it can also generate many other emotions by controlling the muscles model Thus, the participants can express their own emotions and track the emotions of the others through the face of avatar They can benefit from verbal and non-verbal communications and have a new way to find the points of interest in the meeting One important thing is that the face can help the avatar to bring the plausibility to the participants so they can feel that they are in the real meeting with real people

Trang 17

Chapter 3

OpenGL and JOGL overview

3.1 OpenGL overview

3.1.1 Immediate Mode and Retained Mode (Scene Graphs)

There are two different types of APIs for programming real-time 3D applications [32] The first type is called retained mode In retained mode, the description of objects and the scene is provided to the API and then the graphics package will create the image on the screen All things the programmers need to do

is to give commands to change the position and viewing orientation of the user (also called the camera) or other objects in the scene The structure that has just be built is called scene graph The scene graph is a data structure that includes all the objects

in our scene and their relationships to others Many high-level toolkits or "game engines" use this approach The programmer doesn't need to understand how the scene is rendered because the graphic library will take care of rendering the model

or database that he hands over to it Java3D is one example of scene graph API.The second approach to 3D rendering is called immediate mode Most retained mode APIs or scene graphs use an immediate mode API internally to actually perform the rendering For examples, Java3D uses OpenGL or Direct3D to render the geometry created by user In immediate mode, the programmers don't describe the models and environment at high a level as in retained mode Instead, they issue commands directly to the graphics processor Each command has an immediate effect depends on the current setting state and new commands have no effect on rendering commands that have already been executed This allows everything to be controlled at low-level

3.1.2 OpenGL history

OpenGL is an industry-standard, cross-platform Application Programming Interface (API) The specification for this API was finalized in 1992, and the first implementations appeared in 1993 The forerunner of OpenGL is Iris GL (Graphics

16

Trang 18

OpenGL and JOGL overview

Library), the API that was designed and supported by Silicon Graphics, Inc To establish an industry standard, Silicon Graphics collaborated with various graphics hardware companies to create an open standard, which was named "OpenGL."

Until now, seven revisions have been introduced to add new functionality to the API The newest version of the OpenGL specification is 2.1.All newer versions are upward compatible with earlier versions [4],

- Version 1.1 was finished in 1997 and added support for two important capabilities: vertex arrays and texture objects

- The specification for OpenGL 1.2 was released in 1998 and added support for 3D textures and an optional set of imaging functionality

- The OpenGL 1.3 specification was completed in 2001 and added support for cube map textures, compressed textures, multi-textures, etc

- OpenGL 1.4 was completed in 2002 and added automatic mipmap generation, additional blending functions, internal texture formats for storing depth values for use in shadow computations, support for drawing multiple vertex arrays with a single command, more control over point rasterization, control over stencil wrapping behavior, and various additions

to texturing capabilities

- The OpenGL 1.5 specification was published in October 2003 It added support for vertex buffer objects, shadow comparison functions and occlusion queries

- OpenGL 2.0, finalized in September 2004, opened up the processing pipeline for user control by providing programmability for both vertex processing and fragment processing Other features added in 2.0 include support for multiple render targets, nonpower-of-2 textures, point sprites, and separate stencil functionality for front- and back-facing surfaces

- Version 2.1, has just released in August 2006, added support for the revision 1.20 of OpenGL shading language, non-square matrices, pixel

OpenGL implementations can be software implementation or hardware implementation Window applications can call a Windows API which is called the

buffer objects and sRGB textures |- ĐAI H O C Q U Ố C GIA H À NÒI

t r u n g TÁM THỒNG TIN THƯ VIỆN

31.1.3 How does OpenGL work?

17

Trang 19

Graphics Device Interface (GDI) to create output onscreen and graphic card vendors usually supply a driver for GDI to interface with A software implementation of OpenGL takes graphics requests from an application and constructs (rasterizes) a color image of the 3D graphics This image then will be supplied to the GDI to display on the monitor Microsoft has its OpenGL software implementation and almost modem operating system products from Microsoft contain support for OpenGL However, SGI and MESA also released software implementations of OpenGL for Windows that greatly outperformed Microsoft's implementation

Figure 3.1: Software implementation of OpenGL

An OpenGL hardware implementation usually takes the form of a graphics card driver OpenGL API calls from applications are passed to a hardware driver This driver does not pass its output to the Windows GDI for display, it interfaces directly with the graphics display hardware, instead The more components of OpenGL are hardware implemented, the faster the implementation processes the calls from applications and display images onscreen

Figure 3.2: Hardware implementation of OpenGL

18

Trang 20

When an application calls OpenGL API functions, the commands are placed in

a command buffer Vertex data, texture data, etc are also contained in this buffer When the buffer is flushed, the commands and data are passed to the

“Transformation and Lighting” step In this step, points used to describe an object's geometry are recalculated to determine the given object's location and orientation Lighting calculations are performed as well to indicate the brightness of the colors

at each vertex When this stage finished, the data is passed to the “Rasterization” step of the pipeline The rasterizer actually creates the color image from the geometric, color, and texture data and places the image into the frame buffer The frame buffer is the memory area of the graphics display device, which means the image is displayed on the screen Figure 3.3 shows the simple view of OpenGL pipeline At a low level, there are many boxes inside each box of the diagram

Figure 3.3: A simplified version of OpenGL pipeline

3.1.4 OpenGL as a state machine

OpenGL is designed as a state machine [21] If we put it into specific states (or modes) then these states will remain in effect until we change them For example, the current color is a state variable We can set the current color to black, white, red,

or any other color, and all objects will be drawn with that color until we set the current color to something else The current color is only one of many state variables that OpenGL maintains The other states are current viewing and projection transformations, line and polygon stipple patterns, polygon drawing modes, pixel-packing conventions, positions and characteristics of lights, and material properties of the objects being drawn

The execution model for OpenGL can be described as client-server An application (the client) issues OpenGL commands that are interpreted and processed

by an OpenGL implementation (the server) Many server-side variables only have two states: on or off, that are enabled or disabled with the command g l E n a b l e ()

or g l D i s a b l e () For client-side, we enable it with g l E n a b l e C l i e n t S t a t e ()

and disable it with g l D i s a b l e C l i e n t S t a t e () commands Each state variable or

19

Trang 21

mode has a default value, and vve can query the system for each variable's current value at any time In addition, we can save a collection of server-side state variables

on an attribute stack with g l P u s h A t t r i b () and client-side state can be pushed on second stack with g l P u s h C l i e n t A t t r i b () We can temporarily modify the states, and restore the values later with g l P o p A t t r i b () or

g l P o p C l i e n t A t t r i b () for server-side or client-side states, respectively In the case we only need to change the state temporarily, using these commands is likely

to be more efficient than issuing the query commands

3.1.5 Drawing geometry

All graphic objects in OpenGL are constructed from geometric drawing primitives OpenGL only supports the following geometry primitives: points, lines, line strips, line loops, polygons, triangles, triangle strips, triangle fans, quadrilaterals, and quadrilateral strips To send geometry data to OpenGL for rendering, we have three main ways [25] The first is the vertex-at-a-time method The command g l B e g i n () is called to start a primitive and then glEnd() to end it Between these two commands are commands that specify vertex attributes such as vertex position, color, normal, texture coordinates These commands are

g l V e r t e x * (), g l C o l o r M ) , g l N o r m a l * ( ) , and g l T e x C o o r d * (), etc When the vertex-at-a-time method is used, the call g l V e r t e x * () signals the end of the data definition for a single vertex, and it may also define the completion of a primitive After calling the command g l B e g i n () and specifying a primitive type, a graphics primitive is completed by calling enough times g l V e r t e x * () to completely specify a primitive of the indicated type For example, a triangle is completed every third time g l V e r t e x * () is called

The second method to draw primitives is to use vertex arrays With this method, vertex attributes are stored in user-defined arrays, the applications then set

up pointers to the arrays, and use g l D r a w A r r a y s (), g l M u l t i D r a w A r r a y s (), or

g l l n t e r l e a v e d A r r a y s (), etc to draw a huge number of primitives at once Because this method can efficiently pass large amounts of geometry data to OpenGL, it is usually used for portions of code that are extremely performance critical Using g l B e g i n () and gl E nd () , application developers have to specify each attribute of each vertex, so the number of function calls can become significant when objects with thousands of vertices are drawn In contrast, we can draw a large

20

Trang 22

number of primitives with a single function call after the vertex data is organized into arrays by using vertex arrays method Besides, this method can be faster than vertex-at-a-time method because it is often more efficient for the OpenGL implementation to deal with data organized into arrays OpenGL supports some types of array includes colors array, vertex positions array, normal vectors array The values of current arrays are specified with g l C o l o r P o i n t e r (),

g l V e r t e x P o i n t e r (), g l N o r m a l P o i n t e r(), respectively We have to indicate which type of arrays will be used before calling g l D r a w A r r a y s () or

g l M u l t i D r a w A r r a y () The function g l l n t e r l e a v e d A r r a y s () can specify and enable several interleaved arrays simultaneously (e.g., each vertex might be defined with three floating-point values representing a normal followed by three floating-point values representing a vertex position.)

The two former methods are called immediate mode because primitives are rendered right after they have been specified In the third method all function calls are stored in the display list and are pre-processed before executing A display list is

an OpenGL-managed data structure that stores commands for later execution Both commands to set state and commands to draw geometry can be included in display list and are stored on the server side Display list can be processed later with

g l C a l l L i s t () or g l C a l l L i s t s () The display list is initiated with

g l N e w L i s t (), and completed with g l E n d L i s t () All the commands issued between those two calls become part of the display list There are although certain OpenGL commands are not allowed within display lists In common, display list mode can provide a better performance than immediate mode OpenGL implementation can optimize the commands in the display list for the underlying hardware and store the commands in a memory area that allow better drawing performance such as in memory of the graphics accelerator These optimizations require some extra computation or data movement, so applications only see a performance benefit if the display list is called more than once

From OpenGL version 1.5, there is a mechanism that permits vertex array data

to be stored in server-side memory This mechanism typically provides the highest performance rendering because the data can be stored in memory on the graphics accelerator and need not be transferred over the I/O bus each time it is rendered The g l B i n d B u f f e r () command creates a buffer object in the memory of graphic accelerator, g l B u f f e r D a t a () and g l B u f fe r S u b D a t a () commands are used to

21

Trang 23

specify the data values for that buffer The API also supports to efficiently stream data from client to server g l M a p B u f f er () can map a buffer object into the client's address space and obtain a pointer to this memory so that we can specified data values directly Before using other rendering commands that access the buffer, we need to call g l U n m a p B u f fer () to remove the current pointer to that buffer object

3.2 JOGL overview

3.2.1 Introduction

OpenGL is for making graphics and it is fast In almost modern graphic card,

it is hardware accelerated We can use OpenGL to create anything visually that we would want to do Unfortunately, OpenGL is written in C language Besides, we need to put graphics from OpenGL into a window to display them, but OpenGL itself doesn't have any commands for us to create windows This makes OpenGL hard to learn for beginners or programmers that want to use true Object Oriented Programming (OOP) language like Java Java is possibly the most popular true OOP language There have been many attempts to combine OpenGL with Java and provide access to OpenGL through a friendly Java API, such as Java 3D, OpenGL for Java Technology (gl4java) and Lightweight Java Game Library (LWJGL) but the most robust, simple and easy-to-use API was JOGL The reason is JOGL is supported by both Sun (the creators of Java) and SGI (the creators of OpenGL).JOGL is a Java programming language binding for the OpenGL 3D graphics API It supports integration with the Java platform's AWT and Swing widget sets while providing a minimal and easy-to-use API that handles many of the issues associated with building multithreaded OpenGL applications JOGL provides access

to the latest OpenGL routines (OpenGL 2.0 with vendor extensions) as well as platform-independent access to hardware-accelerated off screen rendering JOGL also provides some of the most popular features introduced by other Java bindings for OpenGL like GL4Java, LWJGL and Magician, including a composable pipeline model which can provide faster debugging for Java-based OpenGL applications than the analogous C program JOGL differs from these libraries in that it merely exposes the procedural OpenGL API via methods on a few classes, rather than attempting to map OpenGL functionality onto the OOP paradigm [9],

22

Trang 24

The JOGL binding is itself written almost completely in the Java programming language Indeed, the majority of the JOGL code is auto-generated from the OpenGL C header files via a conversion tool named GlueGen, which was programmed specifically to facilitate the creation of JOGL GlueGen parses the C header files and then magically creates the needed Java and JNI code necessary to connect to those native libraries This design decision has both its advantages and disadvantages The procedural and state machine nature of OpenGL is inconsistent with the typical method of programming under Java, which is bothersome to many programmers However, the straightforward mapping of the OpenGL C API to Java methods makes conversion of existing C applications and example code much simpler The thin layer of abstraction provided by JOGL makes runtime execution quite efficient Because most of the codes are auto-generated, all updates to OpenGL can be added quickly to JOGL [30]

3.2.2 Developing with JOGL

JOGL was designed for the most recent versions of the Java platform and for this reason; it supports only J2SE 1.4 and later It also only supports true color (15 bits per pixel and higher) rendering and does not support color-indexed modes It was designed with New I/O (NIO) in mind and uses NIO internally in the implementation

To develop an application using JOGL, we need both jogl.jar and the appropriate native library jar file (for example, jogl-natives-win32.jar) The jogl.jar

needs to be in C L A S S P A T H for compiling and running code, while the native library file or files also need to be along the j a v a l i b r a r y p a t h at run time We can include the files with our code and point to them directly with - c l a s s p a t h and -

D j a v a l i b r a r y p a t h arguments This approach helps end users who may not want, or may not be able, to add files to these directories

The recommended distribution vehicle for applications using JOGL is Java Web Start JOGL-based applications do not even need to be signed; all that is necessary is to reference the JOGL extension JNLP file Because the JOGL jar files are signed, an unsigned application can reference the signed JOGL library and continue to run inside the sandbox The users only need to launch Java Web Start and download the client application for the first time, the application then is cached

on the client machine and can be launched remotely offline

23

Trang 25

JOGL also supports Applet The J O G L A p p l e t l n s t a l l e r is distributed

inside jogl.jar as a utility class in com s u n o p e n g l util This installer uses some clever tricks to allow deployment of unsigned applets which use JOGL into existing web browsers and JREs as far back as 1.4.2, which is the earliest version of Java supported by JOGL It requires that the developer host a local, signed copy of

jogl.jar and all of the jogl-natives jar flies; the certificates must be the same on all

of these jars Because in the release builds of JOGL, all of these jar files are signed

by Sun Microsystems, so the developer can deploy applets without needing any certificates

3.2.3 Using JOGL

JOGL provides two basic widgets into which OpenGL rendering can be performed The G L C a n v a s is a heavyweight AWT widget which supports hardware acceleration and which is intended to be the primary widget used by applications The G L J P a n e l is a fully Swing-compatible lightweight widget which supports hardware acceleration but it is not as fast as the G L C a n v a s because it typically reads back the frame buffer in order to draw it using Java2D The G L j P a n e l is intended to provide 100% correct Swing integration in the circumstances where a

G L C a n v a s can not be used

Both the G L C a n v a s and G L J P a n e l implement a common interface called

G L A u t o D r a w a b l e so applications can switch between them with minimal code changes The G L A u t o D r a w a b l e interface provides:

- access to the GL object for calling OpenGL routines

- a callback mechanism ( G L E v e n t L i s t e n e r ) for performing OpenGL rendering

- a d i s p l a y 0 method for forcing OpenGL rendering to be performed synchronously

- AWT- and Swing-independent abstractions for getting and setting the size

of the widget and adding and removing event listeners

Applications implement the G L E v e n t L i s t e n e r interface to perform OpenGL drawing via callbacks When the methods of the G L E v e n t L i s t e n e r are called, the underlying OpenGL context associated with the drawable is already

24

Trang 26

current The listener fetches the GL object out of the G L A u t o D r a w a b l e and begins

to perform rendering

The i n i t () method is called when a new OpenGL context is created for the given G L A u t o D r a w a b l e Any display lists or textures used during the application's normal rendering loop can be safely initialized in init() The d i s p l a y ()

method is called to perform per-frame rendering The r e s h a p e () method is called when the drawable has been resized The default implementation automatically resizes the OpenGL viewport so often it is not necessary to do any work in this method The d i s p l a y C h a n g e d () method is designed to allow applications to support on-the-fly screen mode switching, but it is not yet implemented so the body

o f this method should remain empty

1 -Figure 3.4: The structure of an application using JOGL

3.3 Conclusion

In this chapter, we introduce about OpenGL and JOGL By using JOGL, we c;an create OpenGL applications with Java programming language Because JOGL h.as almost OpenGL API commands in few classes and it is generated automatically firom C header files of OpenGL then all programmers who are familiar with OpenGL can use JOGL without any difficulties

25

Trang 27

is required Humans can detect very slight misalignments, so any asynchronism between voice and lip movements may affect to the confidence of the users to the system Second, the effects of co-articulation need to be addressed Co-articulation

is the blending effect that surrounding phonemes have on the current phonemes.There are several ways to synchronize lip movements with speech; the classification mainly depends on the type of speech data First, the text-driven approach uses a text as input The phonemic representation of this text is used to generate both synthetic audible and visible speech The second way, speech-driven method, takes pre-recorded speech as input The phonemes and timing information are taken by analyzing the audio data file and are used to create lip movements The audio file will be played synchronously with facial animations If both text and speech audio are available, the text-and-speech-driven hybrid approach can be applied The phonemes and timing information which got from the text are adjusted

to synchronize with the audio file before using for animation components [1]

The phonemic representation of a text is required to create movements for the lips while talking that text There are two main ways to get the phoneme and timing information from a text First, we can use a phoneme database and search for the phonemic representation of each word in the input text The timing information is obtained by separate algorithms In the second way, we can use some rules to analyze the text to get both representation and timing information Of course, the later method is fast but it is not as accurate as the former method Besides, the algorithms, which are used to process the text in the first method, are improved and the processing ability of PCs is fast enough so the second method is rarely used today

26

Trang 28

Improving lip-sync ability

4.2 Previous work

The original head uses the text-driven approach to get phonemes and timing information The phonemic representation, including the timing information, of the given text is used to generate lip movements Phonemes in the phoneme string are taken as parameters to search for corresponding visemes Each viseme belongs to a viseme segment which has a set of parameters of dominance functions These dominance functions participate in the articulation of speech segment (lip movement) Finally, the viseme segments and timing information of the corresponding phonemes are used to generate the key frames of the lip movement Other frames of this lip movement are generated by interpolating from the key frames of current movement and adjacent movements

The original head model has used the dominance model from (Cohen and Massaro, 1993) to create the co-articulation effect of the lip movements Each viseme has dominance over the vocal articulators that increase and decrease over time during articulation This dominance function determines how close the lips come to reaching their target value of the viseme Each movement has a set of dominance functions, one for each parameter These dominance functions are based

on (DeCarlo et al., 2002)

The co-articulation part of generating lip animations process works well but the audible and visible synchronization part has some disadvantages First, there is not any mechanism to ensure audible and visible speech to start exactly at the same time Besides, generating lip movements from beginning to the end of phoneme string may cause a problem In some situations, it may lead to the misalignment between lip movements and sound from speaker, and as mentioned before, humans can detect a very slight misalignment even only in 5 ms of time Second, because the head can combine and display several facial movements at once, it can show the emotions while talking The original head takes as input the marked up text which specify the text to speak and groups of emotions to display while talking Each emotion group has the start time, duration, onset, offset values and intensity values

of six basic emotions An emotion in a group has the intensity value so that it does not conflict with others The head can speak the given text while displaying in order

groups of emotions with corresponding intensity values from group’s starting time and happen in group's duration time But in practice, we usually want to generate

27

Trang 29

emotions corresponds to specific sentences in the given text but not depend on estimative time The original marked up text looks like this:

To synthesize speech, FreeTTS breaks the input text into sets of phoneme and then converts those phonemes into audible speech FreeTTS does this by performing successive operations on the input text FreeTTS stores the cumulative results of each operation in an utterance structure that holds the complete analysis of the text Figure 4.1 shows the overall architecture for FreeTTS The core of

28

Trang 30

FreeTTS is an engine that contains a voice and an output thread The voice consists

of a set of utterance processors that create, process, and annotate an utterance structure Associated with the voice is a data set that is used by each of the utterance processors The output thread is responsible for two actions: synthesizing an utterance into audio data and then directing this data to the appropriate audio player device

Output Thread

Figure 4.1: FreeTTS Architecture

The heart of FreeTTS lies in its voice and utterance structures The voice maintains the global information about the synthesis process: the locale, the pronunciation lexicon, the unit database, and the wave synthesizer The voice also maintains the set of utterance processors used to create and annotate the utterance structure The utterance structure is a temporary object the voice creates for each audio wave it generates The voice initializes the utterance structure with the input text and then passes the utterance structure to a set of utterance processors in sequence Once the input text is processed (e.g., sent to an audio output device), the voice discards the utterance structure Each utterance processor adds additional items to the utterance structure in a hierarchical and relational manner For example, one utterance processor creates a relation in the utterance structure consisting of items holding the words for the input text Another utterance processor creates a relation that consists of items describing the syllables for the words, with each

29

Định dạng
Số trang	61
Dung lượng	25,68 MB