Lip Reading Using Neural Networks Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an expert in the category of information it has been given to analyze. Neural network is applied in LIP READING, one of the easiest ways to recognize the speech. It is one of the latest techniques widely preferred for speech recognition. We descrie a lip reading system that uses both, shape information from the lip contours and intensity information from the mouth area. Shape information is obtained by tracking and parameterising the inner and outer lip boundary in an image sequence. Intensity information is extracted from a grey level model, based on principal component analysis. In comparison to other approaches, the intensity area deforms with the shape model to ensure that similar object features are represented after non-rigid deformation of the lips. We describe speaker independent recognition experiments based on these features. Preliminary results suggest that similar performance can be achieved by using either shape or intensity information and slightly higher performance by their combined use.


A neural network is a powerful data modeling tool that is able to capture and represent complex input/output relationships. The motivation for the development of neural network technology stemmed from the desire to develop an artificial system that could perform "intelligent" tasks similar to those performed by the human brain. Neural networks resemble the human brain in the following two ways:

A neural network acquires knowledge through learning.

A neural network's knowledge is stored within inter-neuron connection strengths known as synaptic weights.

Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an expert in the category of information it has been given to analyze.

The true power and advantage of neural networks lies in their ability to represent both linear and non-linear relationships and in their ability to learn these relationships directly from the data being modeled. Traditional linear models are simply inadequate when it comes to modeling data that contains non-linear characteristics.

The most common neural network model is the multilayer perception (MLP). This type of neural network is known as a supervised network because it requires a desired output in order to learn. The goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.

A graphical representation of an MLP is shown below.

The MLP and many other neural networks learn using an algorithm called back propagation. With back propagation, the input data is repeatedly presented to the neural network. With each presentation the output of the neural network is compared to the desired output and an error is computed. This error is then fed back (back propagated) to the neural network and used to adjust the weights such that the error decreases with each iteration and the neural model gets closer and closer to producing the desired output. This process is known as "training".


Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:

Facial animation

Speech recognition

Detection and tracking of moving targets

Customer research

Data validation

Risk management


Speech recognition work is one of the most exciting areas of modern computer science research. For the computers to understand speech and gesture. The sheer variety and complexity of a word makes recognizing similar words very difficult. . A neural network is a model of the way in which the human brain works. They are ideally suited to all forms of pattern recognition and have the extraordinary ability to learn.

Neural networks are capable of incorporating multiple heterogeneous input features, which do not need to be treated as independent, finding the optimal combination of these features for classification. The purpose of this work is the exploitation of this potentiality of neural networks to improve the speech recognition accuracy.

Neural network is applied in LIP READING, one of the easiest ways to recognize the speech. It is one of the latest techniques widely preferred for speech recognition.


Lip reading involves the extraction of visual speech features. The most visual speech information is contained in the inner and outer lip contour, it has also been shown that information about the visibility of teeth and tongue provide important speech cues. Particularly for fricatives, the place of articulation can often be determined visually, i.e. for labiodentals (upper teeth on lower lip), interdentally (tongue behind front teeth) and alveolar (tongue touching gum ridge) place. Other speech information might be contained in the protrusion and wrinkling of lips.

Lip reading approaches can be classified into

Image-based systems.

Model-based systems.

Image-based systems use grey level information from an image region containing the lips either directly or after some processing as speech features. Most image information is therefore retained, but it is left to the recognition system to discriminate speech information from linguisticvariability and illumination variability.

Model-based systems usually represent the lips by geometric measures, like the height or width of the outer or inner lip boundary or by a parametric contour model which represents the lip boundaries. The extracted features are of low dimension and invariant to illumination. Model-based systems depend on the definition of speech related features by the user. The definition may therefore not include all speech relevant information and features like the visibility of teeth and tongue which are difficult to represent.

The early systems performed well for a speaker independent recognition task, but it did not contain any intensity information which might provide additional speech information. Here we extend this system by augmenting the feature vector with intensity information extracted from the mouth region. We evaluate the contribution of intensity information separately and in combination with shape features.


For modelling the shape variability of lips, we use an approach based on active shape models. These are statistically based deformable models which represent a contour by a set of points. Patterns of characteristic shape variability are learned from a training set, using principal component analysis (PCA). The main modes of shape variation captured in the training set can therefore be described by a small number of parameters.

The main advantage of this modelling technique is that heuristic assumptions about legal shape deformation are avoided. Instead, the model is only allowed to deform to shapes similar to the ones seen in the training set. Any shape x representing the co- ordinates of the contour points can be approximated by

x=x' + Pb

Where x’ is the mean shape, P the matrix of eigenvectors of the Covariance matrix and b, a vector containing the weights for each eigenvector. Only the first few eigenvectors corresponding to the largest eigenvalues are needed to describe the main shape variability.

Shape model for the inner and outer lip contour with profile vectors, perpendicular to the lip contours.

Lip model with mean shape and mean intensity

We built and tested two models of the lips: Model 1, which represents the outer lip boundary only and Model 2, which represents the outer and inner lip boundary. The models are used to locate, track and parameterise lip movements in image sequences. The weights for the shape modes are recovered from the tracking results and serve as features for the recognition system.


Several approaches for speech reading, based on intensity information have been developed. Our approach for extracting intensity information is based on principal component analysis and is related to the exigent lips. This approach placed a window around the mouth area on which PCA was performed. Since the window does not deform with the lips, the eigenvectors of the PCA mainly account for intensity variation due to different lip shape and mouth opening. We already obtain detailed information of the lip shape from our shape model by a small number of parameters and are therefore mainly interested in intensity information which is independent of lip shape.

We follow an approach, where one dimensional profile is sampled perpendicular to the contour at each model point as shown in Figure 1. But instead of using local grey level models we construct a global grey-level model by concatenating the vectors of all model points to form a global intensity vector h. We then estimate the covariance matrix of the global profile vectors over the training set and perform PCA to obtain the principal modes of profile variation. Any profile h can now be approximated by

where is the mean profile, Pg the matrix of the first column eigenvectors, corresponding to the largest eigen values and bg , a vector containing the weights for each eigen vector.

Example images of a person saying the word “three” with tracking results



The profile model was initially designed and tailored to enable robust tracking of the lips rather than to extract speech information from the profile vectors. The profile model is used to describe the fit between the image and the model. During image search the model is aligned to the image as closely as possible by calculating the optimal weights for the first few eigenvectors. The mean square error (MSE) between the aligned profile and the image is used as cost and a minimization algorithm deforms the shape model to find a minimum cost. The profile weight vector for aligning the model is found using

And the cost E is obtained using

The profile vectors deform with the shape model and therefore always represent the same object features. The weight vector bg provides information about the principal modes needed to align to the image. We recover the weights from the tracking results and use them as speech features.


The weights for the shape model and the intensity model are extracted at each image frame to form frame dependent feature vectors for the recognition system. We use either the shape parameters or the intensity parameters or both parameter sets as feature vector for the recognition system. Assuming accurate tracking performance, the shape and intensity parameters are invariant to translation, rotation and scale. The intensity modes account for both, illumination differences and differences due to the visibility of teeth and tongue and protrusion.

Dynamic speech information is important and often less sensitive to inter speaker variability, i.e. intensity values of the lips will remain fairly constant during speech while intensity values of the mouth opening will change during speech. The intensity values of the lips will vary between speakers but the temporal changes of intensity might be similar for different speakers. Dynamic features will therefore be more robust to different illumination and different speakers.

Recognition accuracy for model-1

Recognition accuracy for model-2


The world of computing has a lot to gain from neural networks. Their ability to learn by example makes them very flexible and powerful. Neural networks also contribute to other areas of research such as neurology and psychology.

We have described lip reading system that uses both, shape and intensity information. An important property of the intensity model is that it deforms with the lip contour model in order to represent the same object features after lip movements. Recognition tests using only intensity parameters indicate that much visual speech information is contained in grey level information which might account for protrusion or visibility of teeth and tongue. Recognition performance was slightly higher for intensity features than for shape features and their combined use outperformed both feature sets.

This excellent application in lip reading is under research and expected to give out lot of fruitful outcomes. Its wide usage for the impaired adds more importance to this application.