Relatively Low Cost Real Time Facial Animation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In human to human communication, facial expressions play a major role. In this modern world the potential of computer networks and Internet are significantly improved and these facilities can be utilised to convey the facial expressions as well in a tele-conference for instance. This can be done by real-time facial animation of the participants in both the end of the conference. The new MPEG 4 standard has recognized this facility and these kinds of approaches can definitely overcome the short comings of the traditional video conferencing techniques. Many prototype systems have been developed on this real-time facial animation where the animation details to the virtual characters are sent across the network. We are going to see one of these prototypes developed by the University of Cape Town in South Africa. They claim that their system uses a performer driver approach, where the facial feature tracking is aided with markers. They have demonstrated this innovative technology using cheap video cameras and ordinary personal computers. This shows that this approach is feasible and cost cutting technique in achieving a real time animation.

I. Introduction

Facial expressions are an important part of human - human communication. They allow us to express our emotions and provide a context for our speech by showing both the nature and extent of our observation and meaning. With the great possibility for communication and interaction given by computer networks and the Internet, it is vital to make provision for traditional face-to-face interaction. This can be done by means of facial animation driven directly by the reactions of the communicating-participants. Giving participants facial expressions should develop the realism of the interaction, encourage better emotional investment and increase the sense of virtual 'presence'. Traditional video-conferencing has very definite bottlenecks in terms of demanding heavy bandwidth and limited collaborative capabilities. The pros of using virtual characters for communication are:

1. They are cheaper on bandwidth, since only expressions and not entire images need to be transmitted. This means wider platform applicability, including mobile devices.

2. Virtual characters offer the participants anonymity. Of course, the user could choose to have their virtual character closely resemble them.

3. If the virtual characters are placed in a virtual environment, there is potential for such richer interaction.

Actually, they present a prototype system for giving real-time facial animation to virtual characters through computer networks. The system uses a performer driven technique, with markers aiding facial feature tracking. With this system they demonstrate the feasibility of a low cost approach using web cams and personal computers.

II. The phenomenon

A big part of facial animation research has been devoted to the designing of computer facial models. The first working facial model was developed by Parke [10]. Since then, models that reflect muscle movements have been developed by Platt [11] and Waters [13]. The usual way has been towards creating realistic 3-D facial models. In contrast to this line of study, some newer work has looked at including non-photorealistic face models [7, 2]. There has been witness suggesting that these models may be more powerful for supporting virtual interactions [7].

To explain the animation of these models, parameterisation schemes have been used to calculate facial expressions. Magnenat-Thalmann et. al. [9] developed an abstract muscle action model (AMA) for explaining facial expressions, building on the seminal work of Ekman and colleagues [5]. Very recently, MPEG have got the facial definition and animation parameters as part of the MPEG-4 SNHC standard. The target being to supply very low bandwidth communication and interaction employing, model-based programming and compression.

In order to drive the expressions of animation, actortracking is often used. Some other techniques may be used, but these may sacrifice continuity and smooth, realistic movements. Moreover, if real-time interaction is vital, the user's experience may be lessened if a clear-cut interface transfers expression control.

While actor-tracking (or animation based on human performer) is used, computer vision techniques are employed to observe different facial features. A survey of the observed feature positions validates the expressions in terms of some parameterisation scheme (eg. AMA or MPEG-4). The observed values are then used to simulate the original expressions in a computer facial model.

A recent instance of such a system is that of Jacobs et. al. [2]. They have created a real-time system for understanding expressions and animating characters drawn by hand. The understanding system records facial features such as the eyes and mouth directly, and process the movements in terms of the MPEG-4 Face Animation Parameters (FAPs). Goto et. al. [6] have designed a real-time facial analysis system to steer facial animations of ultra realistic 3-D models, also using the MPEG-4 FAPs. They have focused their models for use in virtual environments.

Fig 01: Placement of facial markers and corresponding search regions. The 14 markers are labelled m0 to m13. The reference markers are indicated by m0, m1, m8 and m9. The dots overlaid on the right image shows the sampled pixels.

III. Overview of the Method

In fact, the researchers of the mentioned university are developing a facial animation system that utilizes performer based animation.Their objective is to use comparatively low cost, easily available gadgets,with a focus on understanding major expressions and lip movement from speech.

The system runs on windows platform machines employing a single low-cost digital video camera for facial feature recording. The choice of the Windows platform was encouraged by its widespread availability. Luckily, the Windows DirectX SDK has impressive support for multimedia programming, including routines for audio and video capture and processing. In addition to that, moderate quality, low cost cameras 'webcams' are already available for Windows systems. The Windows driver agent provides a stubborn interface for accessing the video stream from these cameras, thereby making sure the code designed for one camera will work with any other.

The researchers have also developed the animation system as a series of components, which I`ll explain further in this paper. This paper will illustrate the techniques used for recording a performer's facial expressions. These include capturing input from a camera, understanding and tracking facial markers, alignment for rigid head motion and mapping expression displacements into normalised parameters. Moreover, this paper will also survey about the two methods they have used to animate faces in 2-D. The first method uses a very simple cartoon-like face, whereas the second morphs an image texture to give more ultra realistic animations. At the end of this paper, you can know about the prototype system that the researchers have used to send a performer's facial expressions to a remote animation client in real-time. This paper also discusses preliminary results achieved during testing. The paper ends with a conclusion and a discussion of future work.

IV. Methodology of Tracking

(a) Input

As per the previous explanation, the Windows DirectX SDK, more specifically DirectShow, supplies routines for recording input from both audio and video streams. They have used these routines to capture video data, frame to frame, from a camera. Using the Logitech WebCam at a resolution of 320x240, with 24 bit color, about 30 frames are recorded per second. This has proven to be sufficient for their tracking system.

(b) Recording Facial Expressions

For understanding the expressions, markers (small, colourful beads) are pasted on an actor's face and recorded over time. The markers reduces the understanding problem and allow for shifter tracking. They also allow more robust tracking when characters are ethnically diverse: subjects may have dark skin or disguised features (a problem for many direct recording systems). Since the markers are employed, it helps to track features under poor or changing lighting conditions. Figure 1 shows the configuration of the 14 markers used by that prototype system.

Normal and usual image segmentation techniques have been used to extract and identify the markers in the video images. The segmentation process first convert from RGB to HSV colour space, and then threshold the image pixels based upon upper and lower bounds on all three color components. Region gathering and growing is employed for each sampled pixel that falls within these value limits. The resulting pixel clusters are processed against a number of conditions, explained in the following section. Since a single 2-D position is needed for each marker, the centre of each cluster is evaluated. In order to ensure real-time marker recognition, they use a technique that takes the pros of the coherence in marker relative positions between successive frames. Further sampling (Subsampling) is used to increase the speed of the marker identification process. These techniques are explained in more detail in the following subsections

(c) Techniques in Calibration

Before the session of tracking commences, the subject (performer) is asked to hold a neutral and natural expression and to keep their head right towards the camera. The system then scans the entire video image searching for all markers on the face. Once it has achieved, these initial marker locations are stored and kept for later use. The first initial marker alignment is useful since it provides access to the system to calibrate for head alignment and expression. This is a onetime process that is not looped during the session.

As soon as the calibration process completed, the performer can start the tracking. The recognition system records/tracks individual facial actions independently: search regions are developed and maintained for each eyebrow, the mouth and for 4 separate reference markers (explained further in this essay). Fig 1 show the markers used for reference and search regions. Tracking/Recording the features freely allows the system to do more easily detect if markers are not there in the place. It also simplifies the act of naming the markers (the markers are tracked and their positions correspondences by every frame).

The search areas are adjusted in equal time spans to make sure that they are bigger than the bounding box of the group of markers for that facility, from the last frame. The search areas are then constrained to ensure that there is no overlapping between them. Within that region which is searched, the image would be sampled.

Figure 2: The technique of sampling used for search regions. The search area border is shown by a dashed line. The cell formed by the grids shows the pixel. The size of the blue dots shows the exact order that the pixels are visited - larger dots mean first ones.

Figure 3: Horizontal area sub-division in the two eyebrow search area. The more frequently scene markers were found during first pass when the sampling rate was really lower (that's why, very fewer dots).

If needed, more than one passes are made over the search area, each time increasing the rate of sampling. The sampling is carried out in such a way that none of the pixels are revisited on the upcoming pass (Fig2). The constraint on the size of understandable markers in the image is given by the space between samples at the highest sampling rate. For the researchers sampling pattern, the diameter of the markers must be at least 2 pixels, in order to be found. The sampling process end when the needed number of good matches is realised 1. For instance, if we know that three markers should be found in an eyebrow region, we end the searching process as soon as we get three good matches.

Especially for the eyebrow region, the searching can be improvised further. Since the markers around the eyebrow region are horizontally delineated, we can be very sure that none of the two markers will have vertically aligned bounding boxes. This in the sense, that once we have a very good and exact match for an eyebrow marker, we can extract its horizontal extent from further sampling passes for the present frame. Fig3 defines this. Note that the bigger, more noticeable markers were found on an initial pass when the sampling rate was relatively lower.

To define a good much is kind of difficult and more often dependent on the markers being employed. We want to extract the false matches, but at the same time while doing that, want to identify a marker, even if it is revealed partially. The researchers have tried to calculate each match by giving a 'weight-of-match' quantifying for each and every potential marker found in the image. The conditions for the weighting of a match are:

? Cluster Size: A cluster contains how many pixels.

? Cluster Shape (1): Height and width ratio of the cluster (should be approx 1 for the markers which are circular in shape).

? Cluster Shape (2): the fraction value of the cluster's bounding box that the cluster occupies (must be near to 1 for circular markers).

While concluding which clusters represent markers, the 'quality-of-match' weighting and the relevance to previous marker relative positions are taken into account.

Before hand on proceeding, it is very vital to discuss the purpose of the four reference markers. These markers are widely isolated, slightly bigger than the rest and generally visible in all valuable orientations. In each frame the system search for these markers first. If even one of the markers is lost, it is rare that the other, relatively smaller markers will be found, so the present frame is void and the system searches again in the next frame. If more than four consecutive frames are lost in chain, the system will ask and wait for a user driven re-starting (Manual Override). In addition on providing an acid test for marker understanding, the reference markers also allow:

? Approximation for head motion (explained in later sections).

? Automatic prediction of region of search in the present frame.

? Automatic prediction of position of markers which are lost.

In the prototype system, the researchers have chosen to use 14 face markers for the following criteria: It requires at least four reference markers for correcting of rigid head motion; to describe the mouth shape again four reference markers are needed; eyebrow needs three reference markers because the outer region eyebrow marker may be included when the head rotates.

Figure 4: Effects of rigid motion are being eliminated

The system can always sort and label the markers because they are arranged in such a way. In any of the given frame, an isolated marker's position is calculated from the corresponding marker in the whole calibration set, converted to account for any overall motion of the user`s head.

(d) Rigid Head Motion Correction

Basically, it is very important to isolate the effects of global head motion and the markers` movement due to changes in expression. Then only the expressions can be accurately evaluated and quantified. To correct the distortions due to head motion, the four markers are used. They stay static on the face, irrespective of expression, and therefore progress movement due to rotations and translations of the head only.

When in the calibration phase, the system captures an initial shot of the face looking straight at the camera ie. The image plane is parallel to the face. The four reference markers which are on the face, from this shot define an initial reference quadrilateral. Now, for each and every other frame of the loop the homography, T, that plots the present quad back to the starting value reference quad is formulated [14, 12]. All the markers are then converted by T. By solving [8] the plane given by the present quad, we can get rid of the effects of head motion which is rigid. Once the rigid motion has been solved for, the marker movements can be calculated. This process is explained in Fig4.

Assuming the face as a perfect plane, the solving process would precisely cancel the rigid motion. Unluckily, this is not the problem, so this method becomes an approximation. In addition to that, since the centroid of the markers determines the reference markers, and the real shape of the clusters denoting the markers is disposed to noise, we cannot find their relative positions exactly. This leads to a 'trembling' effect from frame to frame, which create problems in the evaluation of the homography and results in system errors. To reduce this problem effectively, Gaussian smoothing filter is used to get rid of the high frequency noise from the marker relocations.

(e) How the expressions are normalised

After the corrective transform has been applied to the marker positions, we are in a position to determine marker displacements. These displacements are measured against the neutral expression recorded in the calibration step.

In fact, we would like to be able to apply the measured expression to a face of arbitrary shape. The configuration of a face - the properties that make each face unique, such as the width of the brows and the distance between the eyes - is called its conformation. The face's conformation needs to be accounted for when measuring (and synthesising) expressions.

They use an approximation of the performer's conformation, given by the distances between markers in the neutral calibration image, to 'normalise' the measured expression displacements. We estimate the maximum possible displacements for the various features using face metrics measured from the neutral face (Tables 1 and 2) and use these values to scale the expression displacements.

After the scaling, the horizontal and vertical components of the marker displacements have values between 0 and 1. For transmission, these parameters are scaled and quantised to take integer values between 0 and 255. Each normalised marker displacement can then be represented by two bytes. This method bears some similarity to the MPEG-4 scheme for facial animation parameters (FAP).

Table 1: Maximum displacement values for various feature points in terms of face metrics (Table 2). These values are used to normalise marker displacements.H = Horizontal, V = Vertical.

Table 2: Definitions of the face metrics used in Table 1. The marker positions are taken from a neutral face during initialisation. The second subscript indicates the component of the marker position used: x horizontal) or y (vertical).

V. Facial Animation in 2-Dimension

When applying the normalised parameters to a facial model, the above process is reversed using the model's conformation. This allows us to use the same parameters to drive the animation of a variety of faces.

(a) In Cartoon Face

The first animated model is a simple 2-D polygon model. Groups of polygons represent features such as a mouth, eyebrows and eyes. The polygons are layered and drawn from back to front to prevent occlusion.

Each part of the face is dependent, in both position and movement, on the animation parameters. For example, when the marker on the bottom lip moves down, the animation moves the cartoon face's lip, chin and jaw by an appropriate amount. The initial shape of the cartoon face - its conformation - is set at initialisation time.

Figure 5: A snapshot from their implementation of the Beier-Neely morphing algorithm. Note the morph-lines, conformation points and grid.

Some parts of the cartoon face have automated movement. For example, the eyes blink randomly at a rate of approximately once every seven seconds. Also, the mouth opens and shuts depending

Although the cartoon face is extremely simple, it effectively conveys the expressions of the performer. More complex 2-D cartoon models have been developed by [7]. This tracking system could be used to drive such models.

(b) Morphing the Image

With their second method they have produced a more realistic animation by warping an image of a face. The researchers use a uniform rectangular grid of connected nodes, onto which a facial texture is mapped. The user defines key conformation points on the textured grid corresponding to the points that are tracked on the performer. A warping technique is then used to distort the grid in response to the input parameters.

They have used the technique developed by Beier and Neely [1] to morph the features of a face texture to a target expression. Their technique was used with great success in Michael Jackson's 'Black and White' video to morph between varieties of faces. The technique is based on the concept of morphlines. A series of directed line segments are placed along corresponding features in source and target images. The positions of the endpoints of the lines are then interpolated over time and the surrounding pixels are warped accordingly. For further details of the algorithm, please consult their publication. The original technique applies the warping to every pixel in the target image. In order to achieve real-time animation rates, the researchers have adapted the technique to morph the nodes of the grid, rather than every pixel. The number of grid nodes can be decreased, for faster performance, or increased, to achieve better visual results.

The original technique also considers every morph-line for each pixel in the image. Instead, we define a radius of influence for each morph-line. Only those grid points falling within this radius consider the contribution from that line. These adjustments allow for realtime morphing and animation.

Figure 5 shows a snapshot from the implementation of the morphing algorithm. The grid consists of 80x80 cells; the texture is a 512x512 image. Since the technique uses only a single image, the morphing of the mouth results in stretching of the teeth, which looks very unnatural (see Figure 7). To solve this problem, it would be possible to use two or more images of the face and blend between them when required.

VI. About the Prototype System

The techniques described above were used to develop a real-time prototype system. The system consists of three main components:

I. The recognition system: each frame of the video input is analysed and the positions of the markers are determined. The techniques described above are used to identify and label the markers.

II. The communication system: the marker positions are placed into a packet and transmitted to the remote animation system using Windows asynchronous sockets. At the remote system each packet is unpacked and the values used in context.

III. The animation system: the animation system uses the received marker positions to drive the animations of Section 5. The initial shape of the cartoon face - its conformation - is set at initialisation time with the transmission of a special calibration packet.

Figure 6 shows the major components of the system and the flow of information between them. Occasionally the system mis-tracks due to rapid head movements or major disturbances to the room lighting. The tracking system has functionality to recover from these situations, however. If necessary, a re-initialisation can be performed by the user.

Figure 6: Major system components. The transfer of information indicated by the dashed line is performed once, during system initialisation.

The system was tested at a university openday demonstration. The computer tracking the facial expressions was a dual PII 350MHz machine with 256MB of RAM and Voodoo 2 and FireGL 1000 Pro graphics cards. A Logitech Web-Cam Go camera was used to capture video at a resolution of 320x240 and a frame rate of approximately 30 fps. The remote animation system was an AMD Athlon 500MHz machine with 392MB of RAM and a NVIDIA GeForce 256 graphics card. Both machines were running Windows 2000. The network connection was via a T1 LAN. The system ran consistently at approximately 13 frames per second.

Figure 7 (full page after references) shows a few images captured during a demonstration session. They illustrate the correlation between the expressions of an actress and two virtual characters.

VII. The Conclusion

The researchers are developing a performance based system to provide real-time facial animation to virtual characters. Their system uses relatively low cost equipment to perform facial feature tracking and analysis. They also have tested the system with live video input and animated two different 2-D face models. Their contribution is that they have shown the feasibility of running such a system on low cost hardware. However, from my opinion, further work and user testing are required to fully demonstrate the utility of our system.

VIII. Areas of Future work

Two major areas of further work are:

1. Models: With the demonstration system described above, they used 2-D animations of facial expressions. Other researchers hope to extend the system to animate 3-D facial models, such as those developed by Waters [13].

2. Integration with a Virtual Environment: The entire animation system could be integrated with a virtual environment, such as DIVE [4,3], in order to enhance the collaborative aspects of the interaction. User experiments may then be performed to test the impact of facial animation on immersion and presence.

In addition, commercial systems are now appearing that track facial features reliably without requiring markers. If they can successfully extend the system as indicated above, the future researchers shall proceed to reduce the dependence on markers.

IX. References

[1] Thaddeus Beier and Shawn Neely. Featurebased image metamorphosis. In Edwin E. Catmull, editor, Computer Graphics (SIGGRAPH '92 Proceedings), volume 26, pages 35'42, July 1992.

[2] Ian Buck, Adam Finkelstein, Charles Jacobs, Allison Klein, David H. Salesin, Joshua Seims, Richard Szeliski, and Kentaro Toyama. Performance-Driven Hand-Drawn Animation. In Non-Photorealistic Animation and Rendering Symposium, June 2000.

[3] Carlsson and Hagsand. DIVE - A Multi User Virtual Reality System. In IEEE Virtual Reality Annual International Symposium, pages 394'400, September 18-22 1993.

[4] C. Carlsson and O. Hagsand. DIVE - A Platform for Multi-User Virtual Environments. Computers and Graphics, 17(6), 1993.

[5] Paul Ekman and Wallace V. Friesen. Manual for the Facial Action Coding System. Consulting Psychologists Press, Inc., Palo Alto, California, 1978.

[6] Taro Goto, Marc Escher, Christian Zanardi, and Nadia Magnenat-Thalmann. MPEG-4 based animation with face feature tracking. In Proceedings of Eurographics'99, 1999.

[7] T. Hagen, P. Noot, and H. Ruttkay. Chartoon: a system to animate 2d cartoon faces, 1999.

[8] D. Liebowitz and A. Zisserman. Metric rectification for perspective images of planes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 482'488, June 1998.

[9] N. Mangenat-Thalmann, N.E. Primeau, and D. Thalmann. Abstract muscle actions procedures for human face animation. Visual Computer, 3(5):290'297, 1988.

[10] F.I. Parke. A Parametric Facial Model for Human Faces. PhD thesis, University of Utah, Salt Lake City, UT, December 1974. UTECCSc- 72-120.

[11] S.M. Platt. A system for computer simulation of the human face. Master's thesis, The Moore School, University of Pennsylvania, Philadelphia,1980.

[12] J. Semple and G. Kneebone. Algebraic Projective Geometry. Oxford University Press, 1979.

[13] Keith Waters. A Muscle Model for Animating Three-Dimensional Facial Expression. In SIGGRAPH 87, volume 21 of Computer Graphics Annual Conference series, pages 17'24. Addison Wesley, July 1987.

[14] George Wolberg. Digital Image Warping. IEEE Computer Society Press, 1990.

Figure 7: Expression correlation between an actress and two virtual characters.