A tutorial on the plan and enlargement of automatic speaker recognition system is presented. unusual speaker recognition is the use of a device to recognize a person from a spoken expression. These systems can operate in two modes. One is "recognize a particular person" and second is verify a person's claim identity. Speech processing and the basic components of automatic speaker recognition systems are shown and plan tradeoffs are discussed the various systems performance are compared.
Admission control, authentication, biometrics, dimensions, biomedical signal processing, biomedical transducers, communication systems security, network security, computer security, database identification of persons, speaker recognition and speech recognition verification.
The focus of this chapter is on facilities and network access-control application of speaker recognition. Automatic speech recognition means many things to many men. At one end of the spectrum is the voice-operated alarm clock which ceases ringing when the word 'stop' is shouted at it and the other is the automatic dictating machine which produces typed manuscript in response to the human voice or the expert system which provides answers to spoken questions. Practical speech recognition fall somewhere between these extremes.
Speaker recognition encompasses verification and identification. Automatic speaker verification (ASV) is the use of a machine to verify a person's claimed identity from his voice.
The literature abounds with different terms for speaker verification, including voice verification, speaker authentication, voice authentication, and talker verification. Present day speech recognisers operate with modest vocabularies of ten to a few hundred words. The words must usually be spoken in isolation in a fairly low-noise environment, though some recognition are able to cope with connected words such as strings of digits. Such systems seem very limited but they are beginning to find many applications. Techniques is being dealing with continuous speech are being investigated, and research is being performed to study the multi-user problem.
General overview of speaker recognition have been given by Doddington, Rosenberg, and jack.
Speech production and perception :
It is useful to be able to describe spoken utterances by means of discrete symbols representing the sounds that have been produced. The letter symbols of writing are obviously unsuitable as they are used to represent different sounds in different contexts. The letter 'O' for example is pronounced differently in the word 'One' and in the word'Bone'.
Phonemes are used as descriptive tools by phoneticians. The phoneme is a linguistic unit defined such that if one phoneme is substituted for another in a word the meaning of that word be changed. The set of phoneme is used in each language may be different. For example in English /I/ and /R/ are two distinct phonemes. A consistent set of phonetic symbols for languages is provided by the international phonetic Alphabet (IPA).
Another way of classifying speech sounds is in terms of the way in which they are produced. The units are then known as 'phones' these are usually represented by symbols enclosed in square brackets e.g. , to distinguish them from phonemes which are represented by symbols enclosed in oblique lines in the above diagram.
Speech processing :
Speaker verification is defined as deciding if a speaker is who he claims to be. This is different than the speaker identification problem, which is deciding if a speaker is a specific person or is among a group of persons. In speaker verification, a person makes an identity claim (e.g., entering an employee number or presenting his smart card). In text-dependent recognition, the phrase is known to the system and it can be fixed or not fixed and prompted (visually or orally). The claimant speaks the phrase into a microphone. This signal is analyzed by a verification system that makes the binary decision to accept or reject the user's identity claim or possibly to report insufficient confidence and request additional input before making the decision.
A typical ASV setup is shown in Figure 8.2. The claimant, who has previously
enrolled in the system, presents an encrypted smart card containing his identification information. He then attempts to be authenticated by speaking a prompted phrase(s) into the microphone. There is generally a tradeoff between recognition accuracy and the test-session duration of speech. In addition to his voice, ambient room noise and delayed versions of his voice enter the microphone via reflective acoustic surfaces. Prior to a verification session, users must enroll in the system (typically under supervised conditions). During this enrollment, voice models are generated and stored (possibly on a smart card) for use in later verification sessions. There is also generally a tradeoff between recognition accuracy and the enrollment-session duration of speech and the number of enrollment sessions.
Many factors can contribute to verification and identification errors. Lists some of the human and environmental factors that contribute to these errors. These factors are generally outside the scope of algorithms or are better corrected by means other than algorithms (e.g., better microphones). However, these factors are important because, no matter how good a speaker recognition algorithm is, human error (e.g., misreading or misspeaking) ultimately limits its performance.
examination identification code
speaker identification speech identification talking identification
elevated quality speech
Elveted quality speech
speaker identification speaker revealing speaker confirmation
A typical setup is shown in above figure the applicant who has previously enrolled in the system presents and encrypted smart card contain its identification information. The man attempts to be authentication by speaking a prompted phrase in to the microphone. Its generally tradeoff between recognition accuracy and delayed versions of his voice enter the microphone and its reflective sound surfaces the earlier to verification sessions.
Problem formulation :
Speech is a difficult signal produced as a result of a several alterations routine speech recognition is a file of the understanding of the problems involved in the first attempt to a build machines which could recognise speech were made about 50 years ago. When one telephone subscriber wished to call another the caller spoke to the worker at the exchange and gave the number of the person he wished to contact.
Engineers realised that if a machine could be build recognise spoken digits the control could be the dispensed with and more efficient and less expensive system could introduced. How ever the machine could be build spoken changed very little until the present day. The sound spectrograph had that different words gave ride to different acoustic patterns that all the information required for speech resided in the acoustic signal.
High pass filter 1000hz
Low pass filter 800hz
Linear Prediction :
The all-pole LP models a signal sn by a linear combination of its past values and a scaled present input 
where sn is the present output, p is the prediction order, ak are the model parameters called the predictor coefficients (PCs), sn-k are past outputs, G is a gain scaling factor, and un is the present input. In speech applications, the input un is generally unknown, so it is ignored. Therefore, the
LP approximation _sn , depending only on past output samples. The source un , which corresponds to the human vocal tract excitation, is not modeled by these PCs. It is certainly reasonable to expect that some speaker-dependent characteristics are present in this excitation signal (e.g., fundamental frequency). Therefore, if the excitation signal is ignored, valuable speaker-verification discrimination information could be lost. Defining the prediction error en (also known as the residual) as the difference between the actual value sn and the predicted value _sn yields
Power of Z 0 -1 -2 -3 -4 -5 -6 -7 -8
Predictor coefficient 1 -2.346 1.657 -0.006 0.323 -1.482 1.155 -0.190 -0.059
LP analysis determines the PCs of the inverse filter A(z) that minimize the prediction error en in some sense. Typically, the mean square error (MSE) is minimized because it allows a simple, closed-form solution of the PCs. For example, an 8th-order 8 kHz LP analysis of the vowel /U/ (as in "foot") had the predictor coefficients are shown in the above table.
Speaker Recognition :
Evaluating the magnitude of the z transform of H(z) at equally spaced intervals on the unit circle yields the following power spectrum having formants (vocal tract resonances or spectral peaks) at 390, 870, and 3040 Hz (Figure 8.5). These resonance frequencies are in agreement with the Peterson and Barney formant frequency data forthe vowel /U/ .
Features are constructed from the speech model parameters; for example, the ak shown in Eq. (8.6). These LP coefficients are typically nonlinearly transformed into perceptually meaningful domains suited to the application. Some feature domains useful for speech coding and recognition include reflection coefficients (RCs); logarea ratios (LARs) or arcsin of the RCs; line spectrum pair (LSP) frequencies [4,6,21,22,41]; and the LP cepstrum .
Reflection Coefficients and Log Area Ratios
The vocal tract can be modeled as an electrical transmission line, a waveguide, or an analogous series of cylindrical acoustic tubes. At each junction, there can be an impedance mismatch or an analogous difference in cross-sectional areas between tubes. At each boundary, a portion of the wave is transmitted and the remainder is reflected (assuming lossless tubes). The reflection coefficients ki are the percentage of the reflection at these discontinuities. If the acoustic tubes are of equal length, the time required for sound to propagate through each tube is equal (assuming planar wave propagation). Equal propagation times allow simple z transformation for digital filter simulation. For example, a series of five acoustic tubes of equal lengths with
cross-sectional areas A1, â€¦, A5 is shown in Figure 8.6. This series of five tubes represents a fourth-order system that might fit a vocal tract minus the nasal cavity. The reflection coefficients are determined by the ratios of the adjacent cross-sectional areas with appropriate boundary conditions . For a pth-order system, the boundary conditions given in Eq. (8.7) correspond to a closed glottis (zero area) and a large area following the lips.
Feature selection and measures :
To apply mathematical tools without loss of generality, the speech signal can be represented by a sequence of feature vectors. The selection of appropriate features and methods to estimate (extract or measure) them are known as feature selection and feature extraction, respectively.
Traditionally, pattern-recognition paradigms are divided into three components: feature extraction and selection, pattern matching, and classification. Although this division is convenient from the perspective of designing system components, these components are not independent. The false demarcation among these components can lead to suboptimal designs because they all interact in real-world systems. In speaker verification, the goal is to design a system that minimizes the probability of verification errors. Thus, the underlying objective is to discriminate between the given speaker and all others. A comprehensive review of discriminant
analysis is given in . For an overview of the feature selection and extraction methods, please refer to . The next section introduces pattern matching.
Pattern matching :
The pattern-matching task of speaker verification involves computing a match score, which is a measure of the similarity between the input feature vectors and some model. Speaker models are constructed from the features extracted from the speech signal. To enroll users into the system, a model of the voice, based on the extracted features, is generated and stored (possibly on an encrypted smart card). Then, to authenticate a user, the matching algorithm compares/scores the incoming speechsignal with the model of the claimed user.
There are two types of models: stochastic models and template models. In stochastic models, the pattern matching is probabilistic and results in a measure of the likelihood, or conditional probability, of the observation given the model. For template models, the pattern matching is deterministic. The observation is assumed to be an imperfect replica of the template, and the alignment of observed frames to template frames is selected to minimize a distance measure d. The likelihood L can be approximated in template-based models by exponentiating the utterance match scores
L = exp(-ad)
where a is a positive constant (equivalently, the scores are assumed to be proportional to log likelihoods). Likelihood ratios can then be formed using global speaker models or cohorts to normalize L.
The template model and its corresponding distance measure is perhaps the most intuitive method. The template method can be dependent or independent of time. An example of a time-independent template model is VQ modeling . All temporal variation is ignored in this model and global averages (e.g., centroids) are all that is used. A time-dependent template model is more complicated because it must accommodate human speaking rate variability.
Nearest Neighbors :
A technique combining the strengths of the DTW and VQ methods is called nearest neighbors (NN) [17,20]. Unlike the VQ method, the NN method does not cluster the enrollment training data to form a compact code book. Instead, it keeps all the training data and can, therefore, use temporal information.
The claimant's inter frame distance matrix is computed by measuring the distance between test-session frames (the input) and the claimant's stored enrollment-session frames. The NN distance is the minimum distance between a test-session frame and the enrollment frames. The NN distances for all the test session frames are then averaged to form a match score. Similarly, as shown in the rear planes of Figure 8.8, the test-session frames are also measured against a set of
stored reference "cohort" speakers to form match scores. The match scores are then
combined to form a likelihood ratio approximation .
The NN method is one of the most memory- and compute-intensive speaker verification
algorithms. It is also one of the most powerful methods
Classification and Decision Theory :
Having computed a match score between the input speech-feature vector and a model of the claimed speaker's voice, a verification decision is made whether to accept or reject the speaker or request another utterance (or, without a claimed identity, an identification decision is made). If a verification system accepts an impostor, it makes a false acceptance (FA) error. If the system rejects a valid user, it makes a false rejection (FR) error. The FA and FR errors can be traded off by adjusting the decision threshold, as shown by a Receiver Operating Characteristic (ROC) curve. The operating point where the FA and FR are equal corresponds to the equal error rate.
The accept or reject decision process can be an accept, continue, time-out, or reject hypothesis-testing problem. In this case, the decision making, or classification, procedure is a sequential hypothesis-testing problem . For a brief overview of the decision theory involved.
Using the YOHO prerecorded speaker-verification database, the following results on wolves and sheep were measured. The impostor testing was simulated by randomly selecting a valid user (a potential wolf) and altering his/her identity claim to match that of a randomly selected target user (a potential sheep). Because the potential wolf is not intentionally attempting to masquerade as the potential sheep, this is referred to as the "casual impostor" paradigm. Testing the system to a certain confidence level implies a minimum requirement for the number of trials. In this testing, there were 9,300 simulated impostor trials to test to the desired confidence [5,17].
The DTW ASV system tested here was created by Higgins, et al. . This system is a variation on a DTW approach that introduced likelihood ratio scoring via cohort normalization in which the input utterance is compared with the claimant's voice model and with an alternate model composed of models of other users with similar
The DTW system made 19 FA errors over the 9,300 impostor trials. shows that these 19 pairs of wolves and sheep have interesting characteristics. The database contains four times as many males as it does females, but the 18:1 ratio of male wolves to female wolves is disproportionate. It is also interesting to note that one male wolf successfully preyed upon three different female sheep. The YOHO corpus provides at least 19 pairs of wolves and sheep under the DTW ASV system for further investigation.
Wolves and Sheep :
FA errors due to individual wolves and sheep are shown in the 3-D histogram plots of the individual speakers who were falsely accepted as other speakers by the DTW system. For example, the person with an identification number of 97328 is never a wolf and is a sheep once under the DTW system.
The DTW system rarely has the same speaker as both a wolf and a sheep (there are only two exceptions in this data). These exceptions, called wolf-sheep, probably have poor models because they match a sheep's model more closely than their own and a wolf's model also matches their model more closely than their own. These wolf-sheep would likely benefit from retraining to improve their models. Now let us look at the NN system. Figure 8.12 shows the FA errors committed by the NN system. Two speakers, who are sheep, are seen to dominate the NN system's FA errors. A dramatic performance improvement would result if these two speakers
were recognized correctly by the system.
Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. Speaker-recognition systems can be used to identify a particular person or to verify a person's claimed identity. Speech processing, speech production, and features and pattern matching for speaker recognition were introduced. Recognition accuracy was shown by coarse-grain ROC curves and fine-grain histograms revealed the wolves and sheep of two example systems. Speaker recognition systems can achieve 0.5% equal error rates at the 80% confidence level in the benign real-world conditions considered here.