This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Abstract: In this paper we present the development of an expressive facial animation system. There is a great body of work on face modeling like multilayer approach, iFACE, voice driven facial animation. The paper presents different approaches to the face animation like multilayered approach and iFACE. We illustrate the effectiveness of the system with two applications: (a) a text-to-speech synthesizer with expression control and (b) a conversational agent that can react to simple phrases. We also present the methodology for specifying facial animation based on a multi-layered approach also the iFACE system, a visual speech synthesizer that provides a form of virtual face-to-face communication.
Keywords: Facial animation, text-to-visual speech synthesis, expressive visual speech ,iFACE.
Facial animation is one of the most important aspects of a realistic virtual actor. Expressions convey important information about the emotional state of a person and about the content of speech, making the face the most complex and effective tool of human communication. The study of facial motion has been an active area of research in many disciplines, such as computer graphics, linguistics, psychology et al. Building an animatable, moderately sophisticated human face for research or entertainment purposes requires significant engineering effort. Researchers painstakingly create custom face models because existing ones are either not available, or not suitable for their purposes. To the best of our knowledge, there is currently no easy way to develop a relatively high quality facial animation system without signi_cant effort.
In this paper we propose a facial animation system that is built from robust, off-the-shelf components. Its modular structure makes the system extensible, intuitive and easy to replicate. We demonstrate the versatility and ease of use of the proposed system by implementing two applications.
(a) A visual speech synthesis module: given an input string of text annotated with emotion-tags, the proposed system can produce the corresponding lip-synchronized motion of a high quality 3D face model with appropriate expressions. Synthesizing visual speech is challenging because of the coarticulation effect: the shape of the mouth that corresponds to a phoneme in a speech sequence depends not only on the current phoneme, but also on phonemes that occur before or after the current phoneme
.(b) A conversational agent: this application implements an autonomous agent that can understand simple phrases and react to them .Figure 1 shows an overview of the proposed system. Given input text annotated with expression tags, our text-to-speech (Speech API) module produces the corresponding speech signal along with a phoneme sequence. The phoneme sequence goes through the blendshape mapping module that produces facial motion that matches the input text and expression tags. The resulting facial motion and corresponding phoneme sequence is fed to the coarticulation processor that applies the appropriate rules and produces new sequences. The resulting facial motion is smoothly applied on the facial model by the animation system. Finally, the audio and the motion are composited and displayed on the screen. The text can come from any source; Lastly, the system is modular, allowing developers or user to replace outdated components with more technologically advanced (or more expensive) modules.
2. Previous Work
There are mainly two ways to model faces. The _rst class of models,_rst introduced by [Parke, 1974], is based on 3D polygonal representations. Such models can be constructed by hand, by 3D scanning of real faces, or using image-based rendering techniques [Pighin et al., 1998; Guenter et al., 1998; Blanz and Vetter, 1999].The muscle-based approach [Lee et al., 1995;Waters, 1987] models the underlying muscles and skin of the face. Polygonal face models need to be accompanied by high quality texture maps which can be extracted by scanners [Lee et al., 1995] and photographs [Pighin et al., 1998; Guenter et al., 1998]. Image-based rendering techniques provide an alternative to 3D face models. Using recorded video sequences of human subjects, such techniques can directly produce novel video sequences of animated faces [Brand, 1999;Ezzat et al., 2002; Saisan et al., 2004]. Independently of the underlying face model, the main problem is the synthesis and control of facial motion.
2.1 Face motion synthesis
A facial motion model typically consists of 3 major motion classes:lower face motion (lip and chin), upper face motion (eyes and eyebrows),and rigid head motion. In this work, we focus on lip synchronization that accurately mimics lip motion that matches an input audio sentence. This problem is challenging because of a need to accurately model human coarticulation: the shape of the mouth corresponding to a phoneme depends on the phonemes before and after the given phoneme. Approaches that attempt to solve this problem can be categorized in three ways.The physics-based approach uses the laws of physics and muscle forces to drive the motion of the face. Although it is computationally expensive, it has been shown to be quite effective [Lee et al., 1995; Waters, 1987]. Data-driven approaches use a phoneme segmented input speech signal to search within large databases of recorded motion and audio data for the closest matches[Massaro, 1997; Bregler et al., 1997; Cao et al., 2003; Cao et al., 2005].Within this family of techniques, a similar approach to our system was proposed in [Albrecht et al., 2002], but their work focused on delivering expressive emotions along with the speech. A third class of techniques attempt to eliminate the need for large example databases by creating statistical or dynamic models of face motion[ Brook and Scott, 1994; Masuko et al., 1998; Cohen and Massaro,1993; Brand, 1999; Ezzat et al., 2002; Saisan et al., 2004;Kalberer, 2003; Kalberer et al., 2002]. Most of these approaches use high quality motion or video data to learn compact statistical models of facial motion. These models produce the corresponding facial motion essentially by performing a statistical interpolation of the recorded data..There is a lot of work on facial animation within the area of embodied conversational agents . Much of it implements rule-based coarticulation ideas. RevÂ´eret [Reveret et al., ] present a talking head which can track facial movements. Lundeberg [Lundeberg and Beskow, ] create a dialogue system, called August, which can convey a rich repertoire of extra-linguistic gestures and expressions.[Pelachaud, 1991; Cassell et al., 1994] develop a complex facial animation system that incorporates emotion, pitch and intonation.They also use a rule-based system to model coarticulation. Ruth [DeCarlo and Stone, ] is one of the few publicly available conversational agents. An interesting approach towards combining gestures and speech is presented in [Stone et al., 2004
3.Visual Speech Synthesis
The main problem in visual speech synthesis is modeling coarticulation. To aid in the understanding of our current set of coarticulation rules,the following is a background of the _eld of coarticulation. It is commonly thought that actual human lip movements during the pronunciation of a word is sensitive to the context of the syllables.that occur near the phoneme that is currently being enunciated. This is termed as coarticulation. A simple solution to coarticulation attempts to subdivide a given word into its constituent phonemes.However, modern science suggests that the human brain uses a more sophisticated methodology to convert a plan to speak into the physical mechanisms that generate speech. Linguists have theorized that the speech articulators in real humans are moved in an energy ef_cient manner. Therefore, the components of the mouth will move only when it is necessary to take a different shape in order to correctly pronounce a phoneme. Proper coarticulation requires that the mouth shape used to produce a particular phoneme depends not only on the current phoneme, but also on some set of phonemes before or after the current phoneme
3.1 Emotion Control
A total of six emotions were incorporated in our system: happy, sad, angry, surprised, disgusted, and fearful. This is determined by the available blendshapes generated from FaceGen. We allow.the user to tag input text on the granularity of words. Words between tags are interpreted to have the same emotion as the previous tagged word, and any untagged words at the beginning of the text are treated as neutral words. Currently, our system uses linear transition between different emotions and between different weights of the same emotion.The tags can optionally contain a weight factor that ranges from 0 to 1 and de_nes the amount of emotion that the tag carries. For example the following text: <happy_0.5>Hi Susie<neutral> makes the avatar speak the given text starting with a 50% happy expression and _nishing with a neutral one. When no weight is speci_ed, a weight of 1 is used.
Apart from the proposed rule-based coarticulation model, our system is based on off-the-shelf, publicly available components. Figure 1 illustrates the modular high-level overview of the proposed system. The TTS (text-to-speech) component divides the input text into an audio signal and a phonetic sequence. A mapping from phonemes to visemes provides the visemes to the aforementioned coarticulation rules, which in turn supplies the modi_ed sequence of visemes to the audiovisual composition mechanism. The rest of this section describes each component in detail.
4.1 Face Modeling
To model 3D facial motion we use a blendshape approach [Joshi et al., 2003; Singular Inversion, ]. Given a set of n facial expressions and corresponding polygonal meshes B = fB0;B1; :::;Bng called blendshapes, we can create new facial expressions by blending different amounts of the original meshes
where wi are arbitrary weights and B0 corresponds to a neutral expression. To avoid exaggerated expressions, the weights are typically restricted such that wi 2 [0; 1]. By varying the weights wi over time, we can produce continuous animation. The main challenge of a blendshape approach is constructing an appropriate set of blendshapes.There are two ways to construct blendshapes. First, we can digitally scan the face of a human actor while he/she makes different expressions. This approach requires an expensive scanner and typically produces meshes that have different numbers of vertices. To use the simple blending formula in Equation 1, we must establish a correspondence between vertices across blendshapes, which is not practical.
4.2 Visual Speech Modeling
To animate the motion of the face that corresponds to speech, we use a viseme approach. Visemes refers to the mouth shape that corresponds to a phoneme. Typically, more than one phonemes corresponds to a single shape (viseme). Most approaches consider approximately 40 phonemes (including the silence phoneme), and the ratio of visemes to phonemes is about 1:3. In order to keep the system independent of the speech API employed, we make use of a dynamic mapping from phonemes to visemes. Visemes are constructed by blending together one or more blendshapes to mimic the closest possible real-life viseme. The _rst appendix shows the mapping from the set of phonemes to linear combinations of blendshapes (visemes). Although most phonemes correspond to a single blendshape, some require a linear combination of two or more blendshapes. In some cases, phonemes require a motion transition between two or more blendshapes. Linguists refer to this latter situation as diphthongs. For example, when
pronouncing .bite., the mouth moves while pronouncing the .ai. phoneme corresponding to the letter .i.. Once the sequence of phonemes and the sequence of blendshape weights are computed, a coarticulation rule _lter is applied.
4.3 Speech API
Text-to-speech module is based on the Microsoft Speech SDK Version 5 (MS SDK) [Microsoft, Inc., ], a publicly available toolkit.Given an input string, the MS SDK produces both the corresponding audio and a sequence of phonemes along with their duration.
The MS SDK is simple to use and actively developed. The voice used to speak while animating the face can be selected from the Rule Direction Sources Targets
available voices from the MS SDK. However, we have found that the set of voices available from AT&T [ATT Inc., ] provides greater realism.In addition, the Speech SDK supports natural language processing (NLP). It is fairly easy to process audio input from a microphone and extract the associated text information..
4.4 Blending Expression with Speech
The original expression shapes from FaceGen are used as the default shapes for their corresponding emotions. However, when adding expression shapes to existing visemes, unwanted results may appear due to different shapes competing to change the same region of the face. For example, the anger blendshape provided by FaceGen has an open mouth with clentched teeth. This shape will con_ict with all visemes that require the mouth to be closed when mouthing the phonemes .b., .m., and .p.. We resolve this con_ict by another set of rules called expression constraints. An expression constraint speci_es which blendshapes to use and how much the weights should be for a given pair of phoneme and expression. In the absence of an expression constraint rule, the default shapes and weights will be used. This method gives the user the _exibility to show the same emotion while applying different expression shapes with speech-related shapes. This implementation allows us to use half-shapes which we created from the original expression shapes to obtain an unchanged mouth region ready for visemes integration.
4.5 Composition of Audio and Video
Once the input text string has been parsed by the MS SDK, we generate an appropriate viseme sequence (a set of weighted blendshapes) that is enhanced by the coarticulation rule _lter. The MS SDK offers important timing information that indicates the duration of each phoneme. Our system produces a viseme sequence that will obey the constraints de_ned by the timing of the audio sequence.Once the _nal timing of the viseme sequence is computed, we offer the ability to replay the audio sequence and the synchronized viseme sequence simultaneously.
4.6 Animation System
The proposed system is developed on top of the Dynamic Animation and Control Environment (DANCE) which is publicly available [Shapiro et al., 2005]. DANCE offers the standard functionality that one needs to have in a research tool, such as window management etc. Its plug-in API allows complex research projects to link with DANCE and make use of other people's work. We plan to link our speaking head model to the human model and corresponding simulator provided by DANCE. However, the implementation of our system is not dependent on any DANCE-speci_c features. It can become part of any animation system.
We assembled an expressive facial animation system using publically available components, and here we demonstrate its effectiveness with two applications: a) a text-to-speech synthesizer with expression control and b) a conversational agent that can react to simple phrases. The expressive text-to-speech synthesizer uses a tagged text input and produces the corresponding facial animation. The virtual actor speaks the input sentence with effects of coarticulation while showing the indicated emotion(s). Examining the results of our coarticulation processor, Figure 3 illustrates the ef_cacy of the set of rules that we provide with the proposed system. The virtual actor is speaking the word .soon.. In this case, the vowel that sounds like .oo. has a backward in_uence over the .s. phoneme. The coarticulation rules that we have provided will cause the viseme associated with the .s. phoneme to blend with the viseme that is associated with the .oo. phoneme. In Figure 3, note the difference between the animation sequence before and after undergoing the coarticulation processor by comparing the _rst few frames of the top and bottom row of frames. At the bottom row the mouth properly prepares to form the .oo. viseme while it is in the early stages of the viseme that pronounces the .s. at the beginning of the word.The expression tags in a user input turn into smiles and frowns on the virtual actor's face while he or she is speaking. Figure 4 shows a few snapshots of a face model speaking the phrase .Hi Susie.with emotion control. The input to the system is the text and the corresponding expression tags: .<Surprised>Hi <Happy> Susie.. As expected the character starts with a suprised .Hi.and continues with a happy expresssion. Notice how the smile that corresponds to the happy expression is properly mixed with the speech. The set of expressions along with the method of transition and blending can be expanded or simpli_ed, depending on the user's budget and intent. The modularity of our approach allows for customized components.
6.The Multi-Layered Approach
6.1 The Layers
Although all movements may be rendered by muscles, the direct use of a muscle-based model is very difficult. The complexity of the model and our poor knowledge of anatomy makes the results somewhat unpredictable. This suggest that more abstract entities should be defined in order to create a system that can be easily manipulated. A multi-layered approach is convenient for this. The system proposed in this paper is independent of the animation system. The results are specified in terms of perceptible movements (e.g. elevate the eyebrows with an intensity of 70%).In order to manipulate abstract entities like our representation of the human face (phonemes, words,expressions, emotions), we propose to decompose the problem into several layers. The high level layers are the most abstract and specify "what to do", the low level layers describe "how to do". Each level is seen as an independent layer with its own input and output. This approach has the following advantages:
- the system is extensible.
- the independence of each layer allows the behavior of an element of the system to be modified without impact on the others.
Five layers are defined in our approach:
layer 0: definition of the entity muscle or equivalent.
layer 1: definition of the entity minimal perceptible action.
layer 2: definition of the entities phonemes and expressions.
layer 3: definition of the entities words and emotions.
layer 4: synchronization mechanism between emotions, speech and eye motion.
6.2 Layer 0: Abstract Muscles
This level correspond to the basic animation system. In our case, the software implementation is currently based on the Abstract Muscle Action procedures as already introduced in a previous work (Thalmann,1988). These actions are very specific to the various muscles and give the illusion of the presence of a bony structure. A problem with such an approach is that deformations are based on empirical models and not on physical laws.An interactive and more general model is currently under development. The model consists on a generic representation for the facial components, namely skin, bones, and muscles. The skin is represented as a polygonal mesh. The skull is considered rigid and immobile except the mandible. The muscles are the links between the skin points and the bone. These muscles act as directional vectors for determining the deformations on the skin and their direction can be changed interactively by the designer.
6.3 Layer 1: Minimal Perceptible Action
A minimal perceptible action is a basic facial motion parameter. The range of this motion is normalized between 0% and 100% (e.g. raise the right eyebrow 70%). An instance of the minimal action is of the general form <frame number> <minimal action> <intensity>. The animation is carried out by traditional keyframing.
6.4 Layer 2: Facial snapshot
A facial snapshot is obtained by specifying the value of each minimal action. Once defined, a snapshot has the form that follows: <frame number> <snapshot> <intensity>. It should be noted that several snapshots may be active at the same time, this allows for example to specify a phoneme and a smile at the same time.
6.4.1 Layer 2a: Phonemes
A phoneme snapshot is a particular position of the mouth during a sound emission. It is possible to represent a phoneme instance by a set of minimal actions interacting with the mouth area.
[ snapshot pp =>
[ action raise_sup_lip 30%]
[ action lower_inf_lip 20%]
[ action open_jaw 15%] ]
Normally, each word of a language has its phonetic representation according to the International Phoneme Alphabet. A representative subset of this is encoded in form of snapshots.
6.4.2 Layer 2b: Expressions
An expression snapshot is a particular position of the face at a given time. This is generated by a set of minimal actions in the same way as phonemes.Based on Ekman's work on facial expressions (Ekman), several primary expressions may be classified: surprise, fear, disgust, anger, happiness, sadness (Fig. 3-6). Basic expressions and variants may be easy defined using snapshots.
6.5 Layer 3: Sequences of snapshots
6.5.1 Layer 3a: Words
As already mentioned, a word may be specified by the sequence of component phonemes. However there is no algorithm for automatic decomposition of a word into phonemes (Allen 1987). The solution to this problem is to use a dictionary that may be created using a learning approach: each time an unknown word is detected, the user should enter the decomposition, which is then stored in the dictionary.Another problem is the adjustment of the duration of each phoneme relative to the average duration of the phoneme and its context in the sentence (previous and next phonemes). Several heuristic methods have been proposed to solve this problem by researchers in the area of speech synthesis from text (Allen 1987).
6.5.2 Layer 3b: Emotions
An emotion is defined as the evolution of the human face over time: it is a sequence of expressions with various durations and intensities. The emotion model proposed here is based on the general form of an envelope: signal intensity = f(t) (Ekman 1978b).
An envelope may be defined using 4 stages:
- ATTACK: transition between the absence of signal and the maximum signal.
- DECAY: transition between the maximum signal and the stabilized signal.
- SUSTAIN: duration of the active signal.
- RELEASE: transition to the normal state.
One of the major problems is how to parameterize the emotions. To solve this, we introduce the concept of generic emotion.. If you expand the overall duration of the emotion envelope, the ATTACK and RELEASE stages will expand proportionally less than the SUSTAIN stage. To take into account this proportional expansion, we introduce a sensitivity factor associated to each stage.In order to naturally render each emotion, mechanisms based on statistical distribution have been introduced. For example, we may define a stage duration of 5 Â± 1 seconds according to a uniform distributed law, or an intensity of 0.7Â± 0.05 according to a Gauss distribution.These parameterization mechanisms allow the creation of generic emotions. Once a generic emotion is introduced in the emotion dictionary, it is easy to produce an instance by specifying its duration and its magnitude.
6.6 Layer 4: Synchronization mechanism
We already mentioned the needs for synchronizing the various facial actions: emotions, word flow in a sentence and eye motion. In this layer we introduce mechanisms for specifying the starting time, the ending time and the duration of an action. This implies that each action can be executed independently of the current state of the environment, because the synchronization is dependant on time alone.
7. iFACE : A 3D Synthetic Talking Face
iFACE system, a visual speech synthesizer that provides a form of virtual face-to-face communication. The system provides an interactive tool for the user to customize a graphic head model for the virtual agent of a person based on his/her range data.The texture is mapped onto the customized model to achieve a realistic appearance. Face animations are produced by using text stream or speech stream to drive the model. A set of basic facial shapes and head action is manually built and used to synthesize expressive visual speech based on rules.
The 3D geometry of a face is modeled by a triangular mesh. A few control points are defined on the face mesh.By dragging the control points, the user can construct different facial shapes. Two kinds of media, text stream and speech stream, can be used to drive the face animation. The phoneme information is extracted form text streams and speech streams and is used to determine viseme transitions. The time information is extracted from either the synthetic speech or the natural speech to decide the rate at which the face animation process should occur. In this way, the system is able to synchronize the generated visual stream and audio stream.
Fig 5 : Control Model
A control model is defined on the face component. The control model consists of 101 vertices and 164 triangles. It covers the facial region and divides it into local patches. For each triangle we define a local affine deformation that is applied to a patch of face region. Changing the 3D positions of the feature points, the user can manually deform the shape of the face component. A set of basic facial shapes is built by adjusting the control points of the face model. Those basic facial shapes are in spirit similar to the Action Units of Ekman.13 They are built so that all kinds of facial expressions can be approximated by linear combinations of them. Some examples of the basic shapes used by iFACE. Some head actions, such as nodding, shaking, etc., are also predefined by specified the values of six action parameters and their temporal patterns.
Fig6 :a)Feature points b) Control Model
7.1 Text-Driven Face Animation
First, the text is fed into TTS(text to speech). TTS parses the text and generates a phoneme sequence, timing information and synthesized speech stream. viseme sequence based on a lookup table. Visemes as key frames are located on the starting utterance frames for each phoneme at intervals indicated by phoneme durations. The face shapes between key frames are decided by an interpolation scheme.
Fig. 7. The architecture of text driven talking face
7.2 Speech Driven Face Animation
Fig. 8. The architecture of offline speech driven talking face
The system can be used to construct a realistic 3D model and synthesize natural facial animation from text, voice and emotion states. The system is useful for applications such as human computer intelligent interfaces, collaborative applications, computer language education, and automatic animation productions.
We have presented an affordable, off-the-shelf system for text-to-.////speech and 3D facial animation. The power of the system comes from its simplicity, modularity, and the availability of its components. Our system is easy to implement and extend, and furthermore,it offers a quick solution for applications that require affordable virtual actors that can speak from text.