Comparative Analysis of Machine Learning Algorithms in Speaker Recognition of Whispered Speech

4958 words (20 pages) Essay in Sciences

23/09/19 Sciences Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.



Speaker recognition is the procedure of naturally perceiving who is speaking by utilizing speaker-specific data incorporated into the speech waves. The speaker-specific data act as the unique characteristics of the speakers’ voices. The objective of speaker recognition is to extract, characterize and recognize the information about the speaker’s identity. This has risen as one of the critical application region of Speech and Language Processing. This work proposes an investigation of different algorithmic methods: Support Vector Machines (SVM), K-Means Clustering, Gradient Boosting, EXTRA Trees and Random Forests for the recognition of the speaker of the whispered discourse. The whisper sounds from the CHAIN Whispered Speech corpus are utilized in the trials. Feature vectors from the corpus are extracted by utilizing Mel-Frequency Cepstral Coefficients (MFCC) which convey the speaker’s voice attributes. These coefficients are then used to recognize the obscure speaker from a given arrangement of speakers by implementing the algorithms.


Speaker Recognition involves the identification of the exact person who is speaking. This specific technology can be used in various facets associated with our lives for identification confirmation in biometrics. Biometrics handle the process associated with authentication using unique features of the person. There are usually numerous strategies for biometrics like irises, fingerprints, talk, and face [1]. Many administrations incorporate voice dialing, banking transactions, database administrations, voice message and security control for classified data, remote access to systems and as a forensics tool. As all these strategies are unique to every single individual, they cannot be replicated, copied or stolen. Hence, among the modalities, speech is advantageous as it is easily obtainable and enables convenient usability. Thus biometrics universally offer more protected and forthcoming means associated with identity authentication. Rather than regular speech, this paper focuses on whispered speech. Whispered speech allows high privacy intended in certain tasks to ensure security. Whispered speech differs from normal speech due to its low sound pressure level and lack of pitch frequency. In contrast to natural speech, whispered speech has a different short-term amplitude distribution. Since whispered speech has low Signal Noise Ratio (SNR), it becomes difficult to be replicated thus preventing invasion in a security system.
The speaker is initially unknown, and must be determined after being compared to the templates trained. There is often a very large number of templates that are involved in identifying a speaker. Thus, precise identification of the speaker is a difficult task and hence, speaker recognition becomes an active research area. Speaker Recognition works on the premise that a person’s speech exhibits characteristics that are unique to the speaker. However this task is questioned by the possible variants of input speech signals [2].

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

During the course of development in speaker recognition, there have been a number of researches presented in literature. Several recent approaches using support vector machines have been proposed for speech applications. The first set of approaches attempts to model emission probabilities for hidden Markov models [3], [4]. Even though this approach reduces error to an extent, this method still has problems. Firstly, there is an excessive need for large training sets result in long training times for support vector methods. The emission probabilities must be approximated [5], since the output of the support vector machine is not a probability. Another model for speaker identification or verification presents work using a subspace-based enhancement technique and probabilistic support vector machines (SVMs) [6], [7]. A perceptual filter-bank is created. The prior Signal to Noise Ratio (SNR) of each sub-band within the perceptual filter-bank, which is used to decide the estimator’s gain, to effectively suppress background noises. A probabilistic Support Vector Machine is then used to identify or verify the speaker. Major drawback faced by these models is that searching for a suitable hyperplane in input space, which is extremely restrictive to be of practical use.

The Gaussian Mixture Model (GMM) [8], [9] have become a dominant approach for robust text-independent speaker identification. In [8], the individual Gaussian components of a GMMs are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. But an increase in telephonic speech utterances causes a decrease in accuracy. Further optimizations are performed on the GMM model by stacking the means of the every GMM model to form a GMM mean supervector [10]. This GMM supervector is used as a parameter in a support vector machine (SVM) classifier.

In [11], in order to support speaker verification (SV) in portable devices and in telephone servers with millions of users, Viterbi algorithm is for Hidden Markov model (HMM) alignment is used. Selection of a beam is crucial in however, it is difficult to determine a suitable beam width beforehand. A small beam width may miss the optimal path while a large one may slow down the alignment. A non-heuristic approach is proposed to reduce the search-space. Literature proposes Bayesian approaches used for speaker verification [12]. The verification decision is made by evaluating Bayes factors against a critical threshold. The algorithm aims to apply an efficient algorithm to calculate the Bayes factors for the GMM. But the calculation of the Bayes factors in turn requires the computation of several Bayesian predictive densities. Hence the elliptical basis function parameters were estimated by the EM algorithm with application to speaker verification [13].

Speech signals involve phonemes which is a perceptually distinct unit of sound in a specified language that distinguish one word from another. In Whispered Speech, voicing is greatly reduced or eliminated, so the pronunciation of phonemes does not follow the standard pronunciation like that of normal speech, from which the distinctive features were identified. The phonemes are indistinguishable in whispered speech leading to an erroneous word match. Related works on speaker recognition based on whispered speech have addressed the problem in various approaches [14], [15], [16], [17].

3. Feature Extraction

The most usually utilized component extraction technique in speaker acknowledgment is Mel-Frequency Cepstral Coefficients (MFCC). This element extraction technique was first said by Bridle and Brown in 1974 and additionally created by Mermelstein in 1976.

MFCC emulates the logarithmic view of tumult and pitch of human sound-related framework and endeavors to wipe out speaker subordinate qualities by barring the crucial recurrence and their music. To speak to the dynamic idea of discourse the MFCC likewise incorporates the difference in the component vector after some time as a major aspect of the element vector. Underneath portrays the procedure for acquiring the MFCC highlight vector.

3.1 Pre-emphasis

The initial step is to apply a pre-emphasis filter on the signal to intensify the high frequencies to lessen noise. To avoid numerical problems during the Fourier transform operation, the pre-emphasis filter adjusts the recurrence range since high frequencies have smaller magnitudes compared to lower frequencies.

3.2 Framing

After pre-emphasis, we have to part the signal into short time frames. The rationale behind this progression is that frequencies in a signal, change after some time, subsequently if the Fourier transform is applied over the whole signal, the frequency contours of the signal over time would be lost. To stay away from that, we can securely accept that frequencies in a signal are stationary over a brief time frame. Along these lines, by completing a Fourier change over this brief timeframe outline, we can acquire a decent estimate of the frequency contours of the signal by linking adjoining outlines. We utilize a frame size of 50 milliseconds and a frame step of 25 milliseconds (50% overlap).

3.3 Windowing

Windowing each individual frame will minimize the spectral distortion to narrow the signal discontinuities for each frame both at the beginning and the end. A Hamming window is provided since it provides a good balance between frequency resolution (the separability of closely grouped peaks in the frequency domain) and dynamic range.

3.4 Discrete Fourier Transforms (DFT)

To take the Discrete Fourier Transform of the frame, the following is performed:

Where is a sample hamming window, and is the length of the Discrete Fourier Transform. The Periodogram based power spectral estimate for the speech frame is given by the Periodogram estimate of the power spectrum:

The absolute value of the complex fourier transform is taken and the result is squared. A 512 point DFT is performed and only the first 257 coefficients are kept.

3.5 Mel Scaled Filter Banks

The filter banks are figured by applying triangular channels on a Mel-scale to the power range to separate recurrence groups. The Mel-scale intends to mirror the non-straight human ear view of sound by applying little changes in pitch at low frequencies than they are at high frequencies. Joining this scale makes our highlights coordinate all the more intently what people hear.

We can convert between Hertz (f) and Mel (m) using the following equations:

The formula for converting from frequency to Mel scale is:

To go from Mel back to frequency:

Each filter in the filter bank is triangular having a response of 1 at the center frequency and decrease linearly towards 0 till it reaches the center frequencies of the two adjacent filters where the response is 0 as shown in figure.

A typical MFCC feature vector would be calculated from a window with 512 sample points and consist of 13 cepstral coefficients, 13 first and 13 second order derivatives. Only the first 13(the lower dimensions) of MFCCs are considered since they represent the envelope of spectra and to reduce the dimensionality of the feature space.

4. Machine Learning Techniques

4.1 Support Vector Machines (SVM)

SVM is one of the supervised Machine Learning classification strategies that is broadly connected in the field of speaker recognition. SVM works by selecting samples from all classes known as support vectors, and separating the classes by creating a linear function that partitions them as extensively as conceivable utilizing these support vectors. SVM is principally a classifier technique that performs classsification tasks by developing hyperplanes in a multidimensional space that isolates instances of various class names. Consequently, it goes about as a mapping between an information vector to a high dimensionality space is made utilizing SVM that means to locate the most appropriate hyperplane that partitions the informational collection into classes.

[Fig] This linear classifier aims to maximize the distance between the decision hyperplane and the nearest data point, which is called the marginal distance, by finding the best suited hyperplane. SVM depends on the support vectors, which are the data sets closest to the decision boundary, in their algorithms. This is because removing other data points that are further away from the decision hyperplane will not change the boundary as much as if the support vectors were removed.


Random Forests [18] is an extension over Bagging(Bootstrap Aggregation) used when the aim is to reduce the variance of a set of decision trees. Subsets of data are created from training sample chosen randomly with replacement. Each collection of subset data is used to train their decision trees. Random selection of features occurs rather than using all features to grow trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree. The trees are made uncorrelated to maximize the decrease in variance.

4.3 K Nearest Neighbors

K Nearest Neighbors is a non-parametric classification algorithm. It relegates to an unlabeled sample point, the class of the nearest of a set of previously labeled points named focuses. The rule is independent of the joint distribution of the sample points and their classification. It is appropriate for multi-modal classes and also applications where object can have many labels. The performance is subject to choosing good estimation of ‘k’. There is no principled method to choose ‘k’, with the exception of through computationally costly systems like cross validation. It is affected by noise and it is sensitive to sensitive features.

4.4 EXTRA Trees

The Extra-Trees algorithm builds a group of unpruned choice or regression trees as per the traditional top-down procedure. Its two fundamental contrasts with other tree-based ensemble strategies are that it splits nodes by choosing cut-points completely at random and that it utilizes the entire learning sample (as opposed to a bootstrap replica) to grow the trees. The top-down splitting in the tree is randomized. Rather than computing the locally optimal feature based on information gain or Gini impurity, a random value is selected for the split for each feature under consideration. This value is selected from the feature’s empirical range in the tree’s training set.

4.5 Gradient Boosting

Boosting is used to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. Consecutive trees (random sample) are fit and at every step, the goal is to improve the accuracy from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. This process converts weak learners into better performing model. Upon combining gradient descent and boosting, gradient boosting is formed. At each iteration, a regression tree model is fitted to predict the negative gradient by allowing optimization of an arbitrary differentiable loss function.

5. Proposed Speaker recognition system

Speaker recognition systems consists of two main modules: feature extraction and feature matching. Feature extraction is the process of extracting a small amount of data from the voice signal that is used to represent each speaker. The feature vector signifies specific characteristics of the speaker. There exists various feature extraction techniques like Linear Prediction Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Predictive Coefficients (PLP) and many more that are used. Analysis of the feature extraction methods [19] led to the selection of MFCC (Mel Frequency Cepstral Coefficients) as the feature extraction technique to be utilized, to facilitate better feature extraction by reducing the noise [20]. Feature matching involves the process of identifying the unknown speaker by comparing the extracted features from his/her voice input with the ones from a set of known speakers. Feature Matching is implemented through the machine learning algorithms.

6. Performance measures and Experimental results

6.1 Performance measures:

For the evaluation metric, Accuracy, Precision, Recall and F-Measure are utilized as performance metrics. Accuracy corresponds to the proportion of speakers that are correctly recognized. Precision indicates the proportion of positive identifications that are actually correct. Recall indicates the proportion of actual positives that are identified correctly. F- Measure is the harmonic mean of precision and recall. It could be viewed as a summary measure that combines both precision and recall. F-Measure evaluates, if an increase in precision (recall) outweighs a reduction in recall (precision). The table is the confusion matrix.

The precision, recall and F-Measure are computed using the formula [Fig].

The fig. Describes the metrics computed

6.2 Experimental results and Discussion:

6.2.1 Dataset

The CHAracterizing INdividual Speakers (CHAINS) [21] Whispered Speech Corpus consists of 36 speakers, is utilized for the process of Speaker Recognition. It consists of 28 speakers (14 male, 14 female) of those are from the Eastern part of Ireland, and speak Eastern Hiberno-English. The remaining 8 speakers (4 male, 4 female) are from the UK and the USA. A whispered speech signal is taken as an input and its feature vectors are extracted which characterizes the particular signal. These feature vectors are unique for each speaker. These are used to identify the speaker during the testing phase by matching the features of a known speaker with the unknown sounds.

6.2.2 Experiment Setup

Stratified 10-fold cross validation is used to evaluate the learning algorithms. For the stratified 10-fold cross validation, we randomly divide the dataset into 10 folds, and 9 folds are used to train the classifier, while the remaining 1 fold is used to evaluate the performance, and the class distribution in training and test set is the same as the original dataset to simulate the actual usage of the algorithm. The execution of the feature extraction and feature matching utilize the pyAudioAnalysis [22] library.

The implementation occurs through the learning algorithms: Gradient Boosting, EXTRA Trees, Random Forests, Support Vector Machines and K-Nearest Neighbors. Every algorithm uses parameters designated

  • The soft margin parameter C for the SVM classifier
  • The number of nearest neighbors, k for the kNN classifier.
  • The number of trees in the random forest classifier.
  • The number of boosting stages in the gradient boosting classifier.
  • The number of trees in the forest of the extra trees classifier.

6.2.1 Support Vector Machine Performance:

In a SVM, two variations exist for selecting a hyperplane: a hyperplane with the largest minimum margin, and a hyperplane that correctly separates as many instances as possible. The problem is that both the objectives cannot be achieved completely. The C parameter determines how much the latter objective is necessary that is how much one wants to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyper-plane, if the selected hyper-plane is capable of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyper-plane, even if that hyper-plane misclassifies more points. For very small values of C, misclassified examples are often obtained, even if the training data is linearly separable. The choice of C value depends on what the future data looks like.

For initial small values (0.001), presence of highly misclassified samples leads to less accuracy. A gradual increase in C value still keeping the value low, an increase in accuracy is noted.

























The data we observe is the depiction of using a low C which gives a low minimum margin. The outliers are neglected and thus end up with a much smaller margin. On the right there is a high C used. There is an increase in outliers indicating an increase in the number of elements classified incorrectly. Thus 20.0 as the C value provides the highest accuracy and F-measure.

6.2.2 Random Forest Performance

In general, ensemble methods reduces the prediction variance to almost nothing, improving the accuracy of the ensemble. The variance of the expected generalization error of an individual randomized model is defined as:

The variance of the expected generalization error of an ensemble for predictor is given by  as:

Where p(x) is the Pearson’s correlation coefficient between the predictions of two randomized models trained on the same data from two independent seeds. An increase in the number of decision trees in the random forest implying a larger M causes the variance of the ensemble to decrease, when ρ(x) is less than 1. Therefore, the variance of an ensemble is strictly smaller than the variance of an individual model. Increasing the number of individual randomized models in an ensemble does not increase the generalization error. Thus, an increase in the number of trees causes a decrease in variance. Small variance and little bias thus lead to a highly accurate estimator. Hence parameter 500 indicates the number of trees among the parameters for the highest accuracy.






















6.2.3 EXTRA Trees Performance






















The parameter indicates the number of decision trees in the forest. The main objective of further randomizing tree building in the context of numerical input features is, where the choice of the optimal cut-point is responsible for a large proportion of the variance of the induced tree. From a statistical point of view, dropping the bootstrapping idea leads to an advantage in terms of bias, whereas the cut-point randomization has often an excellent variance reduction effect. An increase in the number of trees indicates a decrease in variance. Low variance infers high precision leading to a higher accuracies. Thus an increase in the number of trees leads to increased accuracy.

6.2.4 K- NEAREST NEIGHBOR Performance

The parameter indicates the number of returned nearest neighbors. The K returned neighbors have an impact on the accuracy. There is normally a K value that gives good performance depending on application which is determined by the receiver operating characteristic (ROC) curve, to find a suitable K value. Normally K = 1 is too noisy, and to smooth things out K = 3 or 5 are used. However, larger K values will tend to bias the KNN algorithm towards a dominant class, thus causing a situation of overfitting. As observed, an increase in K value to a particular limit causes increase in accuracy. But for larger k values, an increase in bias leads to decreased accuracy. Hence for k=5.000, the highest accuracy is obtained.




























6.2.5 Gradient Boosting Performance

The most important two are the number of trees and the learning rate.  The parameter indicates the number of the trees implemented. The learning rate is fixed at 0.1. . For a particular learning rate, to find the optimum number of trees. As observed, with the increase in the number of trees, the accuracy increases to obtain highest accuracy at number of trees set to 500 among the parameters.






















Conclusion and Future Work

This paper discusses the most commonly used supervised machine learning algorithms for classification. Our aim was to prepare a comprehensive review of the key ideas, drawing out pros and cons and useful variants of the discussed algorithms. The paper shows that every algorithm differs according to area of application and it is not the case that a single algorithm is superior in every scenario. The decision of choosing an appropriate algorithm is based on the type of problem and the data available. Again, by choosing two or more suitable algorithm and creating an ensemble, the accuracy can be increased. We hope that the references cited cover the major drawbacks, guiding the researcher in interesting research directions


1. K. Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 4–20, Jan. 2004

2. L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

[3] Wan, V., Campbell, W.M., 2000. Support vector machines for verification and identification, in: Neural Networks for Signal Processing X, Proceedings of the 2000 IEEE Signal Processing Workshop, pp. 775–784.

[4] Ganapathiraju, A., Picone, J., 2000. Hybrid SVM/HMM architectures for speech recognition, in: Speech Transcription Workshop

[5]John C. Platt,1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, “ADVANCES IN LARGE MARGIN CLASSIFIERS”, MIT Press.

[6] J. C. Wang, C. H. Yang, J. F. Wang, and H. P. Lee, “Robust speaker identification and verification,” IEEE Comput. Intell. Mag., vol. 2, no. 2, pp. 52–59, May 2007.

[7] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Comput. Speech Language, vol. 20, no. 2–3, pp. 210–229, Apr.-Jul. 2006.

[8] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995.

[9] Xiang and T. Berger, “Efficient text-independent speaker verification with structural Gaussian mixture models and neural network,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 447–456, Sep. 2003.

[10] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308–311, May 2006.

[11] Q. Li, “A detection approach to search-space reduction for HMM state alignment in speaker verification,” IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 569–578, Jul. 2001.

[12] H. Jiang and L. Deng, “A Bayesian approach to the verification problem: Applications to speaker verification,” IEEE Trans. Speech Audio Process., vol. 9, no. 8, pp. 874–884, Nov. 2001.

[13] M. W. Mak and S. Y. Kung, “Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification,” IEEE Trans. Neural Netw., vol. 11, no. 4, pp. 961–969, Jul. 2000.

 [14] Q. Jin, S. S. Jou, and T. Schultz, “Whispering speaker identification,” in Proc. IEEE Int. Conf. Multimedia Expo, 2007, pp. 1027–1030.

[15] J. Gu and H. M. Zhao, “Whispered speech speaker identification based on SVM and FA,” in Proc. Int. Conf. Audio Language Image Processing, Nov. 23–25, 2010.

[16] J. Xu and H. Zhao, “Speaker identification with whispered speech using unvoiced-consonant phonemes,” in Proc. Int. Conf. Image Anal. Signal Process, Nov. 9–11, 2012.

[17] Jia-Ching Wang, Senior Member, IEEE, Yu-Hao Chin, Wen-Chi Hsieh, Chang-Hong Lin, Ying-Ren Chen, and Ernestasia Siahaan, “Speaker Identification With Whispered Speech for the Access Control System” in IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 12, NO. 4, OCTOBER 2015.

[18] Leo Breiman. Random forests. Machine Learning Journal, 45:5–32, 2001.

[19] Shahzadi Farah, Azra Shamim,”Speaker recognition system using mel-frequency cepstrum coefficients, linear prediction coding and vector quantization”, 2013 3rd IEEE International Conference on Computer, Control and Communication (IC4).

[20]Syed Sibtain Khalid, Safdar Tanweer, Dr. Abdul Mobin, Dr.Afshar Alam, “A comparative Performance Analysis of LPC and MFCC for Noise Estimation in Speech Recognition Task”, International Journal of Electronics Engineering Research, Volume 9, Number 3, 2003.

[21]Cummins, Fred, et al. CHAracterizing INdividual Speakers (CHAINS) LDC2008S09. Web Download. Philadelphia: Linguistic Data Consortium, 2008.

[22]Giannakopoulos T (2015) pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis. PLoS ONE 10(12): e0144610.

Cite This Work

To export a reference to this article please select a referencing style below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

Related Lectures

Study for free with our range of university lectures!