This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This paper reports the results of a comparative study on blind speech separation of three types of convolutive mixtures. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. This method is compared to two other well known methods: Degenerate Unmixing Evaluation Technique (DUET) and Convolutive Fast Independent Component Analysis (C-FICA). The OPCA method is objectively compared in terms of signal-to-interference ratio (SIR) and the Perceptual Evaluation of Speech Quality (PESQ) criteria. The results of experiments carried out using the TIMIT, Noizeus and AURORA speech databases show that OPCA outperforms the other techniques.
Keywords: Blind source separation, speech signals, second-order statistics, Oriented Principal Component Analysis, convolutive mixtures.
In the area of ever-improving communications technologies, we have become used to conversing with others across the globe. Invariably, a real-time telephone conversation begins with a microphone or other audio recording device. Noise in the environment can corrupt our speech signal as it is being recorded, making it harder to both use and understand further down the communications pathway. Other talkers in the environment add their own auditory interference to the conversation. Recent work in advanced signal processing has resulted in new and promising technologies for recovering speech signals that have been corrupted by speech-like and other types of interference. Termed blind source separation methods, these techniques rely on the diversity provided by the collection of multichannel data by an array of distant microphones. The practical goal of these methods is to produce a set of output signals that are much more intelligible and listenable than the mixture signals, without any prior information about the signals being separated .
For several years, the separation of sources has been a particularly active research topic. This interest can be explained by the wide spectrum of possible applications, which includes telecommunications, acoustics, seismology, the biomedical field, separation of extracts of stereophonic music, raising the floor for mobile telephony, the location and further targets of radar and sonar, the separation of speakers (the problem of so-called "cocktail party problem"), detection and separation in communication systems for multiple access, etc. The indiscriminate approach of separation, in which we are interested, also offers the advantage of not requiring major assumptions on the mix: besides its overall structure, often assumed linear, no settings are supposed known.
During recent decades, much attention has been given to the separation of mixed sources, in particular for the blind case where both the sources and the mixing process are unknown and only recordings of the mixtures are available. In several situations, it is desirable to recover all sources from the recorded mixtures, or at least to segregate a particular source. Furthermore, it may be useful to identify the mixing process itself to reveal information about the physical mixing system.
The objective of Blind Source Separation (BSS) is to extract the original source signals from their mixtures and possibly to estimate the unknown mixing channel using only the information from the observed signal with no, or very limited, knowledge about the source signals and the mixing channel. Methods for this problem can be divided into methods using second-order  or higher-order statistics , the maximum likelihood principle , the Kullback-Liebler distance , PCA methods, non-linear PCA , and ICA methods , , . Further information on these methods and some applications of ICA can be found in . Most approaches to BSS assume the sources are statistically independent and thus often seek solutions of separation criteria using higher-order statistical information  or using only second-order statistical information in cases where the sources have temporal coherency , are non-stationary , or eventually are cyclo-stationary. We must note that second-order methods do not actually replace higher-order ones since each approach is based on different assumptions. For example, second-order methods assume that the sources are temporally colored whereas higher-order methods assume white sources. Another difference is that higher-order methods do not apply to Gaussian signals but second-order methods do not have any such constraint.
The problem was originally formulated by the Herault-Jutten  model under the hypothesis of an instantaneous mixture. The interest of the problem has been confirmed by its generality and many different approaches have been proposed in recent years . Later, it turned out that the instantaneous model does not suit all situations encountered in practice. Therefore, more realistic modeling has been proposed. One such model interprets the phenomenon of propagation as a filtering operation, that is to say, it assumes that the environment is characterized by a mathematical function dependent on time and produces a more complex operation, which is a convolution product, to generate so-called convolutive mixtures.
This paper is organized as follows: in Section 2, we present three types of convolutive mixture models we used in our experiments. In section 3, we present the separation model. In Sections 4 and 5, we give, respectively, a small definition of the C-FICA and DUET methods. In Section 6 we briefly describe the implementation of the OPCA method that we propose for the separation of mixed speech signals. Section 7 presents the experimental results and discusses them. Finally, Section 8 concludes and gives a perspective of our work.
2 Mixing models
We record N conversations simultaneously with N microphones; each recording is a superposition of all conversations. The problem is to isolate each speech signal to understand what was said.
With the discrete time index t, a set of M source signals s(t) = (s1(t), . . . , sM(t)) is received at an array of N sensors. The received signals are denoted x(t) = (x1(t), . . ., xN(t)). In many real-world applications the sources are said to be convolutively (or dynamically) mixed. The convolutive model introduces the following relation between the n'th mixed signal and the original source signals. The real convolutive mixing process (including delays) can be assumed as:
The mixed signal is a linear mixture of filtered versions of each of the source signals, and amnk represents the corresponding mixing filter coefficients. In practice, these coefficients may also change in time, but for simplicity the mixing model is often assumed stationary. In matrix form, the convolutive model can be written as:
where Ak is an M Ã- N matrix that contains the k'th filter coefficients. The convolutive mixing process in eq. (2) can be simplified by transforming the mixtures into the frequency domain. The linear convolution in the time domain can be written in the frequency domain as separate multiplications for each frequency:
At each frequency f, A(f) is a complex M Ã-N matrix, X(f) is complex M Ã-1 vector, and similarly S(f) is a complex N Ã- 1 vector. The frequency transformation is typically computed using a discrete Fourier transform (DFT) within a time frame of size T starting at time t:
and correspondingly for S(f, t). Often a windowed discrete Fourier transform is used:
where the window function w(ï´) is chosen to minimize band-overlap. By using the fast Fourier transform (FFT), convolutions can be implemented efficiently in the discrete Fourier domain.
In our experiments, we used three types of convolutive mixture to validate our experiments. The first mixture uses the HRTF (Head Related Transfer Function) filters, the second represents an anechoic mixture and the third one, a convolutive mixture using an array of microphones with a random complex matrix described in .
2-1 HRTF mixing model
The perception of the acoustic environment or room effect is a complex phenomenon linked mainly to the multiple reflections, attenuation, diffraction and scattering on the constituent elements of the physical environment around the sound source that the acoustic wave undergoes in its propagation from source to ear. Spatial filters and perceptual cues that govern sound localization are achieved. These filters contain information related to the phenomena of diffraction, scattering and reflection that a sound wave sustains during its travel between its source and the entrance to the ear canal of the listener. These filters are commonly called the Head Related Transfer Function or HRTF. The principle of measuring HRTF is to place microphones in the ears and record the signals corresponding to different source positions. The HRTF is the transfer function between the source signals and the signals at the ears. The HRTF is then considered as a linear system and time invariant. Each HRTF is represented by an FIR filter (Finite Impulse Response), causal and stable.
2-2 Anechoic mixing model
To model a recording of an auditory scene made using microphones in a room where the walls are assumed to form an anechoic chamber, we use the following mixture model:
The only difference from the instantaneous model is the introduction of time delay ï‚¶mn. The delay ï‚¶mn is the time interval between the emission of sound by the source n and the uptake of it by the microphone m. This delay can be characterized by the difference in arrival times between microphones. amn is a relative attenuation factor corresponding to the ratio of the attenuation of the paths between n sources and m microphones. Eq. (6) of the anechoic mixture can be written loosely in the field of Short Term Fourier transform by the following equation:
where A(f)=[a1(f),...,an(f)] is the M Ã- N matrix whose column vectors are called the directions of sources. Each column of A (f) is a complex vector. The anechoic mixing model is not realistic in that it does not represent echoes, that is, multiple paths from each source to each mixture .
2-3 Microphone array with complex matrix mixing
Assume an array of p microphones (sensors) and denote by x(t) the received, p-dimensional, signal vector at instance t. Denote also by q the number of signals impinging on the array. A common model for the received signal vector is:
- si( . ) = scalar complex waveform referred to as the ith signal
- A(ï±i) =[a(ï±1); a(ï±2);â€¦ ; a(ï±q)]. It is a p x 1 complex vector, parameterized by an unknown parameter vector ï±i associated with the ith signal. We assume that the q (q <p) signals sl (.), â€¦, sq( .) are complex (analytic).
Many problems may be formulated using this simple, linear model  . These problems differ by the structure of the mixing matrix, A, by the assumed knowledge about the unknown parameters or by the statistical modeling.
3- The separation model
The objective of BSS is to find an estimate, Å(t), which is a model of the original source signals s(t). For this, it may not be necessary to identify the mixing filters Ak explicitly. Instead, it is often sufficient to estimate separation filters W that remove the cross-talk introduced by the mixing process (fig. 1).
The goal in source separation is not necessarily to recover identical copies of the original sources. Instead, the aim is to recover model sources without interference from other sources; each separated signal Ån(t) should contain signals originating from a single source only. Therefore, each model source signal can be a filtered version of the original source signals,
Figure 1: Source separation system
The criterion for separation is satisfied if the recovered signals are permuted, and possibly scaled and filtered, versions of the original signals,
where P(f) is a permutation matrix and D(f) is a diagonal matrix with scaling filters on its diagonal. If one can identify A(f) exactly and choose W(f) to be its inverse, then D(f) is an identity matrix, and one recovers the sources exactly.
A survey of frequency-domain BSS is provided in . An advantage of blind source separation in the frequency domain is that the separation problem can be decomposed into smaller problems for each frequency bin in addition to the significant gains in computational efficiency. The convolutive mixture problem is reduced to "instantaneous" mixtures for each frequency. Another problem that arises in the frequency domain is the permutation and scaling ambiguity. If the convolutive problem is treated for each frequency as a separate problem, the source signals in each frequency bin may be estimated with an arbitrary permutation and scaling,
4 Comparative studies of three methods of blind speech separation
4-1 C-FICA method
The C-FICA algorithm (Convolutive extension of Fast-ICA: Independent Component Analysis) is a time-domain fast fixed-point algorithm that realizes blind source separation of convolutive mixtures. This algorithm was proposed by Thomas et al.  in 2006. It consists of time-domain extensions of the Fast-ICA algorithms developed by Hyvarinen and Oja  for instantaneous mixtures.
This approach is based on a convolutive sphering process (or spatio-temporal sphering) that lets the use of the classical Fast-ICA updates extract iteratively the innovation processes of the sources in a deflation procedure. For the estimation of the source contributions, the authors use a least-squares criterion whose optimization is realized by a Wiener filtering process. The C-FICA algorithm proposes different parameters and options. One can choose for instance the orders of the extraction and recoloration filters, the non-Gaussianity criterion or the windows on which the extraction and recoloration filters are estimated. The main output is the signal vector 'S' whose rows are some estimated contributions of the different sources (the most powerful contribution for each considered source). Some other outputs are available as the estimated filters or the innovation processes for instance. The algorithm ideally works with MA (Moving Average) mixtures of MA sources because of the relevance proof that is made by mapping the mixtures into linear instantaneous ones. Nevertheless, with real speech sources, they obtain notable separation results by choosing adequately the window of extraction.
4-2- DUET method
DUET (Degenerate Unmixing and Estimation Technique) is a method that applies when sources are W-disjoint orthogonal, that is, when the time-frequency representations of any two signals in the mixtures are disjoint sets. The method uses an online algorithm to perform gradient search for the mixing parameters and simultaneously construct binary time-frequency masks that are used to partition one of the mixtures to recover the original source signals . A. Jourjine et al.  show that, for anechoic mixtures of attenuated and delayed sources, the method allows one to estimate the mixing parameters by clustering ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters are then used to partition the time-frequency representation of one mixture to recover the original sources. The approach was to assume that the source signals are W-disjoint orthogonal and then to note that mixing is, for a given time-frequency choice, just a function of one source. For the anechoic mixing model, the ratio of the windowed Fourier transform of the two mixtures for a given time-frequency choice depends only on the mixing attenuation and delay parameter associated with one source. Clustering these ratio estimates reveals the mixing parameters.
Then, the authors assume that the sources are W-disjoint orthogonal in the time-frequency domain associated with the transform considered a weighting window W. This strong assumption means that the sources do not overlap in the time-frequency domain. The problem is to determine the time-frequency areas (t, f) where each of N sources is present. For this, the authors use the assumption that the sources are W-disjoint-orthogonal and make an amplitude/phase histogram of all the values of attenuation and delays encountered. Assuming it is known, this histogram shows peaks around the values corresponding to the parameters of the mixing matrix for the time-frequency domain. Recognizing the limitation of the W-disjoint orthogonality, the authors have shown that this property could be slightly relaxed by considering an approximate version especially for speech. The technique is valid even in the case when the number of sources is equal to or larger than the number of mixtures.
4-3 OPCA method
The OPCA (Oriented Principal Components Analysis) algorithm was previously proposed by Diamantaras and Papadimitriou  for separating random signals, specifically for four multilevel PAM (Pulse Amplitude Modulation) signals filtered by an ARMA (Auto-Regressive Moving Average) coloring filter. The authors show that OPCA in combination with almost arbitrary temporal filtering can be used for the blind separation of linear instantaneous mixtures.
In this work we aim to use OPCA to perform a BSS on a convolutive mixture of speech signals according to the model illustrated in Fig. 2. OPCA can be considered as a generalization of PCA. It corresponds to the generalized eigenvalue decomposition of a pair of covariance matrices in the same way that PCA corresponds to the eigenvalue decomposition of a single covariance matrix. Oriented PCA (OPCA) describes an extension of PCA involving two signals u(k) and v(k). The aim is to identify the so-called oriented principal directions e1,â€¦,en that maximize the signal-to-signal power ratio E(eiTu)2/E(eiTv)2 under the orthogonality constraint: eiTRuej=0, iï‚¹j. OPCA is a second-order statistics method, which reduces to standard PCA if the second signal is spatially white Rv = I. The solution of OPCA, as shown in Figure 2, is a generalized eigenvalue decomposition of the matrix pencil [Ru,Rv].
GED [RY(0), RX()]
Figure 2: The Evaluation of Speech Quality: Block diagram of BSS for convolutive mixtures using the OPCA method in frequency domain.
Subsequently, we shall relate the BSS problem with the OPCA analysis of the observed signal x and almost any filtered version of it. Note that the 0-lag covariance matrix of x(k) is:
Now, consider a scalar, linear filter having h=[h0.,â€¦,.hM] (referred to as J-Filter in Fig. 2) operating on X(f,t):
The 0-lag covariance matrix of Y is expressed as:
From Eq. (1) it follows that:
Provided that A is square and invertible we can write:
Eq. (18) expresses a Generalized Eigenvalue Decomposition problem for the matrix pencil [RY(0), RX(0)]. This is equivalent to the OPCA problem for the pair of signals [Y(f,t), X(f,t)]. The generalized eigenvalues for this problem are the diagonal elements of D. The columns of the matrix A-T are the generalized eigenvectors. The eigenvectors are unique up to a permutation and scale provided that the eigenvalues are distinct (this is true in general). In this case, for any generalized eigenmatrix W we have W = A-TP with P being a scaled permutation matrix; each row and each column contains exactly one non-zero element. Then the sources can be estimated as:
which can be written as:
where Åœ(f,t)=[Åœ1(f,t), Åœ2(f,t)]T is the estimated source signal vector and W(f) represents an unmixing matrix at frequency bin f. The unmixing matrix W(f) is determined so that Åœ1(f,t) and Åœ2(f,t) become mutually uncorrelated, because the source signals S1(f,t) and S2(f,t) are assumed to be zero mean and mutually uncorrelated. The estimated sources are equal to the true ones except for the (unobservable) arbitrary order and scale. Then we apply the IFFT of Åœ(f,t) for recovering the estimated signals in the time domain.
The J-filter mentioned in Figure 2 is expressed as:
where ï¡ and ï¢ are parameters to be fixed. These parameters are optimized by re-formulating the D matrix of eq. (18) as the following :
Note that the optimality criterion of the J-filter is related to the eigenvalue spread . The maximization criterion used to find ï¡ and ï¢ is given by:
where di,j represents the diagonal elements of D. In our experiments, the J-filter order of 3 was chosen. The search of the optimal filter is transformed into the search for the filter that spreads the eigenvalues as much as possible . The search is exhaustive and is performed for values of ï¡ and ï¢ varying within a given interval of h (ï€¢ ï¡, ï¢ï€ ïƒŽï€ [hmin, hmax]). In the experiments we fixed hmin = -5, hmax = 5, while the increasing step was 0.2.
5 The Evaluation of Speech Quality
5-1 The Perceptual Evaluation of Speech Quality
To measure the speech quality, one of the more reliable methods is the Perceptual Evaluation of Speech Quality (PESQ). This method is standardized in ITU-T recommendation P.862 . PESQ measurement provides an objective and automated method for speech quality assessment. As illustrated in Fig. 3 , the measure is performed by using an algorithm comparing a reference speech sample to the speech sample processed by a system. Theoretically, the results can be mapped to relevant mean opinion scores (MOS) based on the degradation of the sample . The PESQ Algorithm is designed to predict subjective opinion scores of a degraded speech sample. PESQ returns a score from 0.5 to 4.5, with higher scores indicating better quality. For our experiments we used the code provided by Loizou in . This technique is generally used to evaluate speech enhancement systems. Usually, the reference signal refers to an original (clean) signal and the degraded signal refers to the same utterance pronounced by the same speaker as in the original signal but submitted to diverse adverse conditions. In the PESQ algorithm, the reference and degraded signals are level-equalized to a standard listening level thanks to the preprocessing stage. The gain of the two signals is not known a priori and may vary considerably. In the original PESQ algorithm, the gains of the reference, degraded and corrected signals are computed based on the root mean square values of band-passed-filtered (350-3250 Hz) speech. The full frequency band is kept in our scaled version of normalized signals. The filter with a response similar to that of a telephone handset, existing in the original PESQ algorithm, is also removed. The PESQ method is used throughout all our experiments to evaluate the OPCA estimated speech. The PESQ has the advantage to be independent of listeners and number of listeners.
Figure 3: Block diagram of the PESQ measure computation .
5-2 The signal-to-interference ratio
The Signal-to-Interference Ratio (SIR) has been highlighted in the literature to be a most efficient criterion for several methods aiming at reducing the effects of interference. The SIR is an important entity in communications engineering that indicates the quality of a speech signal between a transmitter and a receiver environment. It is selected as the criteria for optimization.
The received signal quality is typically measured by the SIR, which is the ratio of the power of the wanted signal to the total residue power of the unwanted signals. The signal to interference ratio is measured in dB and values over 20 are thought to be good. It is scale and permutation invariant and can be seen as measuring the correlation between the matched true and estimated signals.
The principle of the performance measure is to decompose a given estimate Å(n) of a source si(n) as the following sum:
where starget(n) is an allowed deformation of the target source si(n) and einterf(n) is an allowed deformation of the sources that accounts for the interference of the unwanted sources. Given such decomposition, one can compute the performance criteria provided by the Signal-to-Interference-Ratio (SIR):
6 Experiments and results
To evaluate our approach in the convolutive case, we compared it with the well-known C-FICA and DUET techniques, and we use three types of convolutive mixture. Through this comparison, we aim to demonstrate the effectiveness of the proposed separation technique based on the OPCA method. This approach is effective, as can be seen in the time domain, where we note that the original signals (Fig. 5) and estimated signals by OPCA (Fig. 7-(a)) are very close. The OPCA method has the advantage that the pre-processing step is not necessary. The method was verified subjectively by listening to the original, mixed and separated signals. We obtained a very good separation.
In the following experiments, the source and observation data dimension is n=2 and the 16 kHz speech signals were taken from the TIMIT database and looped so that each signal has duration six seconds. The TIMIT corpus contains broadband recordings of a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States, each reading 10 phonetically rich sentences . Some sentences of the TIMIT database were chosen to evaluate our BSS methods. We tested OPCA using a filter of order 3, as mentioned earlier. The use of more correlation matrices increases the information input in the estimation process and then improves the separation quality. We consider a two-input, two-output convolutive BSS problem, so we mixed in convolution two speech signals: s1(n) and s2(n), respectively pronounced by a man and a woman.
In the experiment for the HRTF model, a dummy head with two microphones (one in each ear) was used instead of the microphone array. This kind of recording was used to investigate how effective the BSS is during a more natural configuration of the sources. This situation takes into account all the changes in an acoustic field connected with the head, i.e., the Head Related Transfer Function. The HRTF influences both the sound pressure level and spectra of the source signals reaching the ears. We tested our overall framework with a mixing filter measured at the ears of a dummy head. We selected impulse responses associated with source positions defined by 30 and
-80-degree angles in relation to the dummy head as we can see in fig. 4.
Figure 4: The convolutive (HRTF) model with source positions at 30 and -80-degree angles in relation to the dummy head.
For the anechoic model, the mixtures contained the two sources with relative amplitudes (1.1, 0.9) and sample delays (-2, 2) and for the third mixture with a microphone array, the mixing complex matrix is chosen randomly.
(a)In figure 5, we represent the two original speech signals, male and female recorded from the TIMIT database. In figure 6, we show the signals mixed by convolution with the three types of mixing model, and in figure 7, the estimated signals after applying the OPCA, C-FICA and DUET approaches.
Figure 5: (a): Original signals: Male sentence: "This brings us to the question of accreditation of art schools in general", (b): female sentence: "She had your dark suit in greasy wash water all year".
Figure 6: Signals mixed by convolution with: (a) HRTF model, (b) Anechoic model and (c) Microphone Array.
Figure 7: Estimated signals by: (a) the OPCA method, the DUET method, (c) the C-FICA method.
For PESQ evaluation, OPCA was the best one in comparison with the C-FICA and DUET approaches, and this is true for the three types of mixing, which we can see in Table 1. Note that for anechoic mixing, DUET gives a good PESQ but not as good as the OPCA method.
Table 1: Comparison of PESQ for the C-FICA, DUET and the OPCA methods
Microphone array with complex matrix
The input Signal-to-Interference-Ratio (SIRin) before separation was -4.66 dB for the male speech and -1.75 dB for the female speech. In Table 2, SIRout represents the output Signal-to-Interference-Ratio after separation, when Starget in eq. (26) is the original speech signal and the interference einterf , the difference between the original and estimated speech signals.
As shown in Table 2, the ratio SIRout of the OPCA method is larger than the SIRout of the C-FICA and the DUET approaches. Taking as an example the HRTF mixing model, the improvement in the SIRout ratio for C-FICA was 8.36 dB and 4.81 dB respectively for male and female speech. It was 3.81 dB and 1.66 dB for DUET and by 23 dB and 29.31 dB against the OPCA approach.
Table2: Comparison of SIR for the C-FICA, DUET and OPCA methods.
Microphone array with complex matrix
In frequency domain algorithms, the challenge is to solve the permutation ambiguity, i.e., to make the permutation matrix P(f) independent of frequency. Especially when the number of sources and sensors is large, recovering consistent permutations is a severe problem. With N model sources there are N! possible permutations in each frequency bin . Many frequency domain algorithms provide ad hoc solutions, which solve the permutation ambiguity only partially, thus requiring a combination of different methods. The problem is not very severe in our case, because we work with two sources.
We have presented a blind speech separation technique of convolutive mixtures using an oriented principal component analysis method. All earlier approaches have consistently used two steps: one pre-processing (sphering) step followed by a second-order analysis method such as PCA. The OPCA approach has the advantage that no pre-processing step is required, as sphering is implicitly incorporated in the signal-to-signal ratio criterion, which is optimized by OPCA .
The proposed separation technique of mixed observations into source estimates is effective, as shown in the time domain. Subjective evaluation is performed through listening to the estimated signals before and after mixing and after separation was used. The results are very satisfactory; we obtained a very good separation. We tested the method with other speech signals from the TIMIT database, Noizeus and the AURORA database and the results were similar. These results confirm the efficiency of the OPCA method in convolutive mixtures that we previously used for the first time, in the separation of speech signals in an instantaneous mixing case . We are continuing our research efforts by implementing a combination of OPCA and different methods, for resolving the problem of permutation ambiguity and applying it in a mobile communication framework.