This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Speaker recognition and speech recognition are both related. As against determining what was said, speaker recognition enables the automatic recognition of who is speaking based on the speaker's voice's unique characteristics. Text-independent speaker recognition is a popular research area, with various methods available in the literature. This paper presents a simple approach to text dependent speaker identification. This method is based on the Symlet wavelets for feature extraction. The extracted features are then classified using data mining algorithms. In this study, J48 and NaÃ¯ve Bayes are used for classifying the features.
Keywords: Speaker Recognition, Text dependent, Symlet Wavelet, NaÃ¯ve Bayes, J48
The advent of digital computers in the 1950s spurred modern speech recognition. Along with speech analysing and capturing tools like analog-to-digital converters and sound spectrograms, computers enabled researchers to locate feature extraction methods from speech which ensured intra-word discrimination. Automatic speech segmentation advanced into linguistically relevant units (like phonemes, syllables, words) and also into new pattern-matching/classification algorithms. These techniques have improved to a level where very high recognition rates are assured with commercial systems being available at nominal prices.
At present, speech recognition is used in manufacturing units which require voice data entry or commands when the operator's hands are occupied. Speech recognition is also applied in medicine, where voice input accelerates routine report writing. Speech recognition enables users to control personal workstations or for remote interaction with applications when they lack touch tone key pads. Speaker identification makes possible non-intrusive monitoring with high accuracy conforming to security requirements. It also provides greater freedom to the physically challenged .
Speaker recognition and speech recognition are both related. As against determining what was said, speaker recognition enables the automatic recognition of who is speaking based on the speaker's voice's unique characteristics . Deciding whether a particular speaker uttered something is verification and locating a person's identity from well-known speakers set is identification. The common form of speaker recognition (text-independent) is not very accurate for hug speaker populations, but if spoken words are user constrained (text-dependent) and prevents speech quality from varying much, then this too is possible on a workstation.
When a person talking has to be identified, speech signals must process and extract speaker variability measures instead of being analysed by segments corresponding to phonemes or text pieces. Only one classification is made for speaker recognition, based on input test utterance. Though studies reveal that certain acoustical features work better in speaker identity prediction, few recognizers examine specific sounds due to problems in phone segmentation and identification.
Both automatic speaker verification and identification use a stored reference patterns (templates) database for N known speakers. Both use analysis and decision techniques. Verification is easier as it compares test pattern against a reference pattern involving a binary decision: Is there a good match against the claimed speaker's template? Error rate for speaker identification is higher as it requires selecting which of system known N voices matches test voice or "no match" if test voice differs from reference templates.
Comparing test and reference utterances for speaker identity is easier for identical underlying texts, as in text-dependent speaker recognition. Cooperative speakers allow application of speaker recognition directly through using same words to train and test the system. This is possible in verification, whereas speaker identification usually needs text-independent methods. Higher text-independent method error rates mean the requirement of more speech data for training and testing. Automatic computer speaker recognition is an active research area from early 1960s and spectrogram for personal identification was introduced.
Text-independent speaker recognition is a popular research area, especially for applications like forensic science, intelligence gathering, and passive voice circuit's surveillance. Free-text recognition cannot control conditions influencing system performance, including speech signal variability, distortions and communication channel noise. Recognition has multiple problems including unconstrained input speech, uncooperative speakers, and uncontrolled environmental parameters which make it necessary to focus on an individual's features and his/her unique speech characteristics .
Various approaches are available in the literature for speaker identification based on the Gaussian mixture model (GMM)  or kernel methods such as the support vector machine (SVM)[5, 6], Non-negative matrix factorization .In this paper, wavelet feature extraction speaker recognition is investigated. This method is based on the Symlet wavelets for feature extraction. The extracted features are then classified using data mining algorithms. In this study, J48 and NaÃ¯ve Bayes are used for classifying the features. The rest of the paper is organized as follows: related works available in the literature is presented in section 2, section 3 deals with the materials and methods used in this investigation, section 4 details the experimental details and section 5 concludes the paper.
2. Related works
Kekre et al  presented a simple text dependent speaker identification approach, combining spectrograms and Discrete Cosine Transform (DCT). This is based on DCT use to locate similarities between free sample spectrograms. The spectrogram set forms the database for experiments and not raw speech samples. Performance is compared for different number of DCT coefficients when applied on entire spectrogram, when DCT is applied to spectrogram divided into blocks and when DCT is applied to a spectrogram Row Mean.It revealed that the mathematical computations required for DCT on Row Mean of spectrogram is drastically less compared to the other two methods with almost equal identification rate.
Shafik et al  presented a robust speaker identification procedure from degraded speech signals based on the Mel-frequency cepstral coefficients (MFCCs) for feature extraction from degraded speech signals and wavelet transforms of such signals. It is a known fact MFCCs based speaker identification procedure is not robust when noise and telephone degradation are present. So degraded signals wavelet transform feature extraction adds speech features from signal approximation and detail components which in turn help in achieving high identification rates. The proposed method uses Neural Networks to match features. Comparison between the proposed method and traditional MFCCs based feature extraction from noisy speech signals/telephone degraded speech signals with additive white Gaussian noise (AWGN) and colored noise reveals that the proposed method has better recognition rates computed at different degradation cases.
Li et al  presented an ear-based feature extraction algorithm where feature is based on a recently published time-frequency transform and modules set to simulate signal processing in the cochlea. The feature is applied to speaker identification to offset acoustic mismatch problems in training/testing. Usually acoustic models performance when trained in clean speech drops when tested on noisy speech. The proposed feature shows strong mismatched situation robustness. As experiments show, both MFCC and the proposed feature have near perfect performances in speaker identification, in clean testing conditions, but when input signal SNR drops to 6 dB, MFCC feature's average accuracy is only 41.2%, when the proposed feature still continues with an average accuracy of 88.3%.
Yamada et al  suggested a novel semi-supervised speaker identification method to alleviate non-stationary influence like session dependent variation, recording environment change, and physical conditions/emotions. Voice quality variants are expected to follow the covariate shift model, where voice feature distribution alone changes in training and test phases. The proposed method includes kernel logistic regression and cross validation weighted versions and can in theory be capable of mitigating covariate shift influence. Experiments show that through text-independent/dependent speaker identification simulations that the proposed method promises much with regard to voice quality variations.
Kekre et al  presented a Vector Quantization method for Speaker Identification consisting of training and testing phases and vector quantization (VQ) is used for feature extraction in both. Two variations were used. In method A, codebooks generated from speech samples are converted into 16 dimensional vectors with an overlap of 4. In method B, speech samples generated codebooks are converted into 16 dimensional vectors without overlap. Test sample codebook is generated and compared with database stored reference samples codebooks stored in the database for speaker identification. Results from both schemes when compared show that method 2 provides slightly better results than method 1.
Zhao et al  proposed local spatio-temporal descriptors for visual based speaker recognition and representation. Spatiotemporal dynamic texture features of local binary patterns extracted from localized mouth regions describe motion information in utterances, which capture spatial/temporal transition characteristics. Structural edge map features are extracted from image frames to represent appearance characteristics. Combining dynamic texture and structural features has motion and appearance together, providing description ability for speech's spatiotemporal development. The proposed method got promising recognition results on experiments on BANCA and XM2VTS databases, compared to the other features.
An Automatic speaker identification system has 2 stages; feature extraction and classification as seen in Figure 1 operating in training and recognition modes. Both include a feature extraction step, sometimes referred to as the system's front end. Feature extractor converts digital speech signal into a numerical descriptor sequence called feature vector . Features in this paper use Symlet wavelet for extraction.
Figure 1: Automatic speaker identification system.
For successful classification, every speaker is modelled using a data samples set in training mode, from where a feature vectors set is generated and saved in a database. Features are extracted from training data striping away unnecessary training speech samples information leaving only speaker characteristic information with which speaker models are constructed . When a data sample from an unknown speaker arrives, pattern matching techniques map features from input speech sample to a model that corresponds to a known speaker.
This audio data was collected for speaker identification to develop country contexts. It includes 83 unique voices, 35 female and 48 male. It provides audio for performing limited vocabulary speaker identification through digit utterances. Data was collected in partnership with Microsoft Research, India .
Data was collected over telephone using an IVR (Interactive voice response) system in March, 2011, India. Participants are Indian nationals from differing backgrounds, each being given a few lines of digits, and asked to read numbers after benig prompted in the system. Each participant read five lines of digits, one digit at a time.
The numbers were read in English. There are various background noise levels, ranging from faint hisses to audible conversations/ songs. Totally, about 30% of the audio has some background noise.
Feature extraction contributes to speaker identification based on low-level properties. Extraction produces enough information for speaker discrimination capturing this in a form/size that ensures efficient modelling. So feature extraction is defined as the process of reducing data present in a given speech sample while retaining speaker discriminative information at the same time.
The Fourier transform (FT) includes fixed time-frequency resolution and a well-defined inverse transform. Fast algorithms exist for forward and inverse transforms which are simple and efficient computation algorithms, when applied to speech processing. Wavelets are time and frequency bound waveforms. Wavelet analysis splits mother wavelet signals into shifted and scaled versions. Continuous Wavelet Transform (CWT) is known by the wavelet function Ïˆ adding signal times multiplied by scaled and shifted versions. Mathematically the continuous wavelet is defined by
Many wavelet coefficients C, a scale and positionfunction are due to CWT. Original signals constituent wavelets are got by multiplying coefficients by applicable scaled and shifted wavelets. Daubechies proposed symlets-symmetrical wavelets - and got by modifying indications of the db family . Both wavelet families are similar, with difference of db wavelets having maximal phase while symlets have minimal phase. They are compactly supported wavelets with slight asymmetry with wavelet coefficient for it being any positive even number/highest number of vanishing moments for a support width.
Principal Component Analysis
PCA is an established feature extraction technique for dimensionality reduction based on the assumption that most class information is in directions along which the variations are the largest. These directions are principal components. A common PCA derivation in terms of a standardized linear projection maximizing variance is the projected space. PCA is useful for data compression, reducing dimensions number without information loss.PCA is used to reduce the dimension of the feature vector extracted .
Classification in automatic speaker identification systems is a feature matching process between new speaker features and those saved in the database.
Given an objects set of a known class and with a known variables vector, the aim is rule construction enabling assigning future objects to a class, if only variables vectors are given describing future objects. Problems of supervised classification are ubiquitous, and methods for such rule construction were developed. NaÃ¯ve Bayes classifier is a commonly used classifier, easy to build not needing complicated iterative parameter estimation schemes to be applicable for large data sets. Users not familiar with classifier technology easily understand why it is makes classification it does as it is easy to interpret. Also, Naive Bayes model appeals due to its simplicity, elegance, and robustness. An old classification algorithm, it is still effective in its simple form with modifications being introduced, by statistical, data mining, machine learning, and pattern recognition communities to ensure better flexibility. .
Attribute conditional probabilities in the predicted training data set class is estimated by NaÃ¯ve Bayes classifier, classification being on the parameter training data's mean and variance. Inputs are represented by feature vector and classified to a likely class. NaÃ¯ve Bayes classifier assumes independent features thereby simplifying learning. When inputs are represented by feature vector X and classes by C, NaÃ¯ve Bayes predicts class as follows:
Where X=(X1,â€¦,Xn) is the feature vector and C is a class.
Decision tree structures organize classification schemes. In such tasks tasks, decision trees visualize what steps are taken to arrive at a classification. Every decision tree begins with a root node, considered the "parent" of other nodes. Each tree node evaluates a data attribute and determines what path to follow. The decision test compares a value against some constant. Decision tree classification is done through routing from root node until arrival at a leaf node.
J48 is an earlier algorithm's version developed by J. Ross Quinlan, the popular C4.5. Decision trees represent information from a machine learning algorithm, offering a fast way to express structures in data. The J48 algorithm has many options related to tree pruning. Many algorithms try to "prune", their results. Pruning produces fewer, easily interpreted results and can also be a tool to correct overfitting. The algorithm described above recursively classifies until every leaf is pure, ensuring that data has been categorized as close to perfect as possible ensuring maximum accuracy on training data. It could create excessive rules that describe particular data idiosyncrasies alone.. When tested on new data, rules may not be effective. Pruning reduces model accuracy on training data as pruning employs various means to relax decision tree specificity, hopefully improving its test data performance. The overall concept is gradual generalization of a decision tree until it attains a flexibility and accuracy balance.
The speech samples from the dataset were used for speaker identification. 50 samples were used for evaluating the classifiers. Examples of the speech input file given to a participant is as follows:
The features from the samples were extracted using Symlet wavelets. The resulting features were reduced using PCA for efficient classification. The input sample and the output are shown in Figure 2 and 3 respectively.
C:\Users\omshree\Desktop\jan 2013\1feb2013\siet srini\input_speech.jpg
Figure 2: Input Speech
C:\Users\omshree\Desktop\jan 2013\1feb2013\siet srini\symlet_output.jpg
Figure 3: Output in Symlet Wavelet
The samples were classified using NaÃ¯ve Bayes and J48. The summary of results is tabulated in Table 1.
Table 1: Summary of the results
Correctly Classified Instances
Root mean squared error
It is observed from Table 1 that the classification accuracy achieved by both J48 and NaÃ¯ve Bayes is same at 82%. Though, the root mean squared error for J48 is slightly less than the NaÃ¯ve Bayes. Table 2 gives the precision, recall and f-Measure by class for both the classifiers. Figure 4 and 5 show the precision and recall and the f-Measure respectively.
Table 2: Precision, Recall and F-Measure by Class
Figure 4: Precision and Recall
Figure 5: f Measure
It is observed from the above graphs and table that though there is minor variation of values of precision and recall for the different classes; the weighted average of the precision and recall for both the classifiers are nearly same. Both the classifiers i.e., J48 and NaÃ¯ve Bayes perform equally well for classifying the speech samples. Further investigations are required to refine the feature extraction process and also to investigate the performance of soft computing methods for classification.
Speech Recognition research face multiple problems such as unconstrained input speech, uncooperative speakers, and uncontrolled environmental parameters which make it necessary to focus on an individual's features and his/her unique speech characteristics.Various approaches are available in the literature for speaker identification. In this paper, a wavelet feature extraction speaker recognition approach is investigated. This method is based on the Symlet wavelets for feature extraction. The extracted features are then classified using data mining algorithms, J48 and NaÃ¯ve Bayes. Experimental results showed that classification accuracy of 82 % was achieved by the classifiers. Further investigations are required to improve classifier efficiency.