Voice Activity Detection Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Voice activity detection, is a speech processing technique in which the presence or absence of human speech is detected.

The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session [1].

It can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.

VAD is an important enabling technology for a variety of speech-based applications. Therefore various VAD algorithms have been developed that provide varying features and compromises between latency, sensitivity, accuracy and computational cost. Some VAD algorithms also provide further analysis, for example whether the speech is voiced, unvoiced or sustained. Voice activity detection is usually language independent.

Applications of VAD

Traditional voice-based communication uses Public Switched Telephone Networks (PSTN)[3]. Such systems are expensive when the distance between the calling and called subscriber is large because of dedicated connection. The current trend is to provide this service on data networks. Data networks work on the best effort delivery and resource sharing through statistical multiplexing.

Therefore, the cost of services compared to circuit-switched networks is considerably less. However, these networks do not guarantee faithful voice transmission. Voice over packet or Voice over IP (VoIP) systems has to ensure that voice quality does not significantly deteriorate due to network conditions such as packet-loss and delays.

Therefore, providing Toll Grade Voice Quality through VoIP systems remains a challenge. In this paper we concentrate on the problem of reducing the required bandwidth for a voice connection on Internet using Voice Activity Detection (VAD), while maintaining the voice quality.

Fig1. Block Diagram for Voice Activity Detection in Mobile Communication

Voice Activity Detection in VOIP

Voice over Internet Protocol (VOIP) is a general term for a family of transmission technologies for delivery of voice communications over IP networks such as the Internet or other packet-switched networks. Other terms frequently encountered and synonymous with VOIP are IP telephony, Internet telephony, voice over broadband (VoBB), broadband telephony, and broadband phone. Internet telephony refers to communications services - voice, facsimile, and/or voice-messaging applications - that are transported via the Internet, rather than the public switched telephone network (PSTN). The basic steps involved in originating an Internet telephone call are conversion of the analog voice signal to digital format and compression/translation of the signal into Internet protocol (IP) packets for transmission over the Internet; the process is reversed at the receiving end[2].

VOIP systems employ session control protocols to control the set-up and tear-down of calls as well as audio codecs which encode speech allowing transmission over an IP network as digital audio via an audio stream. Codec use is varied between different implementations of VOIP (and often a range of codecs are used); some implementations rely on narrowband and compressed speech, while others support high fidelity stereo codecs.

In VoIP systems the voice data (or payload for packet) is transmitted along with a header on a network. The header size for Real Time Protocol (RTP) is 12 bytes[2]. The ratio of header to payload size is an important factor for selecting the payload size for a better throughput from the network. Smaller payload helps in a better real-time quality, but decreases the throughput.

Alternately, higher size payload gives more throughputs but performs poorly in real-time. A constant payload size representing a segment of speech is referred to as a 'Frame' in this paper and its size is determined by the above considerations. If a frame does not contain a voice signal it need not be transmitted. The VAD for VoIP has to determine if a frame contains a voiced signal. The decision by VAD algorithms for VoIP is always on a frame by-frame basis.

Chapter 2: VAD Algorithms

Various algorithms are used in voice activity detection. These algorithms are classified into different domains depending upon the parameters and processes involved. The two important techniques being used currently are:

Time Domain Techniques

Frequency Domain Techniques

Each of these has various complexities and speech quality. Results vary with change in parameters used in both the techniques.

Before going deep into these techniques the following terms should be cleared.

2.1 Speech Characteristics

Different types of sounds ranging from consonants to vowels are present in a conventional speech. A pattern of silence and such sounds collectivity form a speech. When a signal containing speech is transmitted through a channel the signal gets adulterated with noise. A VAD processes this signal and based on certain classification differentiate between speech and silence in presence of noise. Thus by detecting and rejecting silence segments available bandwidth is saved.

Fig 2: Speech Characteristics

2.2 Speech Bursts are energy peaks in a signal, whenever there is a speech present in a signal its energy value will be more than that of noise.

2.3 Silence Periods

In a speech signal there are different types of pauses. There are pauses which come between two words in a speech which are hardly of 100ms. Other pauses are silent packets which come between some speech activities or even between the sentences. Such silent packets are of major concern in VAD.

Apart from these silent segments, Due to presence of noise there will be times when energy of these silent segments will be greater than zero. In those cases VAD has to reject signals with energy levels lesser than threshold value.

2.4 Noise

Noise is an unwanted component present in a signal (Analog or Digital). Whenever a signal is sent through a channel Noise comes into picture.Based on the number of sources Noise can be of two types:

Stationary Noise

Non Stationary Noise

Stationary Noise: When the noise is coming from a single source it is called stationary noise.

Non Stationary Noise: When the noise is coming from multiple sources it is called non stationary noise.

For ex. In a conference call where two or more individuals are talking is an example of non stationary noise.

Desirable aspects of VAD algorithms

A Good Decision Rule: A physical property of speech that can be exploited to give consistent judgment in classifying segments of the signal into silent or voiced segments.[4]

Adaptability: The biggest challenge in VAD is its robustness towards non stationary noise. An efficient VAD should adjust its parameters according to variable environmental noise. This will be helpful to WLAN applications where the user is mobile.

Low Computational Complexity: VAD being a real time technique should use lesser complex calculation so that the delay can be avoided.

Toll Quality Voice Reproduction

Bandwidth Saving

Parameters for VAD

A speech signal is differentiated between speech and silence according to the speech characteristics. The signal is divided into frames. Each frame has a non negative parameter related to it.

Time Domain Techniques have following parameters which are measured:

Energy Threshold Value

Zero Crossing Rate

Based on these parameters the frame is classified as ACTIVE or INACTIVE. If the energy value is above the threshold value then the frame is Active otherwise its inactive and is discarded.

Choice of Frame Duration

ACTIVE Frames that are transmitted are queued up in a packet-buffer at the receiver. This allows them to playing audio even if incoming packets are delayed due to network conditions.

Consider, a VoIP system having a buffer of 3-4 packets. Having frame duration of 10ms allows the VoIP system to start playing the audio at the receiver's end after 30 to 40ms from the time the queue started building up. If the frame duration were 50ms, there would be an initial delay of 150-200ms, which is unacceptable [4].

Therefore, the frame duration must be chosen properly. Current VoIP systems use 5-40ms frame sizes.

Specifications for a VAD on the basis of toll quality are:

8kHz sampling frequency

8 bit PCM

Single Channel Recording

An ITU G.711 Coding.

Energy of a frame

Energy of a frame is an important parameter as it reflects the presence of voice in that frame or not.

Let X(i) be the ith sample and each frame has k samples. So the jth frame will be


The Mean Associated Energy with each frame Er is given by following expression


Initial Value of Threshold

The initial value of threshold is important for evolution of threshold. This threshold tracks down the noise. An absurd initial value will lead to poor estimate of VAD.[5]

The classical method of finding the initial value of threshold.

The VAD algorithm uses a prerecorded sample that contains only background noise. For ex. the initial estimate of energy is obtained by taking the mean of energies of each sample as in.


The initial threshold of variance of spectrum is give by the formula


Energy of a frame is a reasonable parameter on the basis of which frames may be classified as ACTIVE or INACTIVE. The energy of ACTIVE frames is higher than that of INACTIVE frames

The Rule states that

Ej is the energy of noise frames and kEr is the energy threshold.

K is a scaling factor for safe band adaptation of threshold.

LED-Linear Estimator Detector

The Er level calculated in Eqn (5) is for stationary Noise whereas in practical situations where the user is mobile we have noise coming from different sources and hence building up Non Stationary Noise. In such conditions an adaptive threshold is more appropriate [3]. Which is fiven by following relation.


The Er Threshold is updated with convex combination of old threshold and updated noise.

P is chosen considering the impulse response of the equation 6. P usually lies in 0<p<1.

Now we take Z transform of eqn 6

Z transform of equation 6 is



According to fig4:

At p=0.2 the fall time (i.e. the number of delay packets which influence the energy threshold value) comes out to be 15 delay Units. Since each packet is of 10ms that means p=0.2 corresponds to 150ms.

That means that 15 inactive frames will bring out some change in threshold value. Usually a normal pause between two words comes out to be 100ms which should not be counted as a silence frame. So p=0.2, 150ms is good enough for Er not getting updated by normal in between pauses.

Fig 3. P vs Fall Time.

Advantages: This algorithm is simple to implement which gave an acceptable speech quality at compression.

Short Comings:

Under varying background noise the Er in equation 6 is incapable of keeping pace with rapidly changing background noise leading to undesirable speech clipping at the beginning and ending of speech bursts

Words like Flower and High (Non-Plosive Phonemes) were completely clipped. As LED depends completely on Energy Threshold.

Undue Clippings may happen in signals with lower SNRs and hence it deteriorates performance.

Fig 4: Fall times for different value of p

Adaptive Linear Energy Based Detector(ALED)

In LED we used eqn 6 to calculate varying Energy threshold which was unable to comply with varying background noise. In ALED we used second order statistics to calculate Er. The receiver buffer maintains a Esilence. Whenever there is a new silence packet it is added to the queue and the old one is removed. The Variance of the buffer is given by:


A change in the background noise is reckoned by comparing the energy of the new INACTIVE frame with a statistical measure of the energies of the past 'm' INACTIVE frames. Consider the instant of addition of a new INACTIVE frame to the noise buffer. The variance, just before the addition, is denoted by σold.

After the addition of the new INACTIVE frame, the variance is σnew.

If there is a change in background noise that means


Depending upon the ration of σnew to σold the value of p is decided on the basis of table 2.1

Unlike LED now the Er changes rapidly with change in background noise and hence Active and Inactive Voice can be classified using the same energy threshold method described in equation 5.

Improvement: Er (Energy Threshold) can rapidly change with varying background noise.

Short Comings:

Words like Flower and High (Non-Plosive Phonemes) were completely clipped.

Undue Clippings may happen in signals with lower SNRs and hence it deteriorates performance.

Weak Fricative Detectors (WFD)

The Above two Time domain techniques are both energy based where the signal energy was being compared to energy thresholds. The only short coming in which low energy signals were clipped can be resolved using WFD.

WFD works on the principle that Zero Crossing rate for voice signals lie in a certain value only.[5-15]. For noise the zero crossing is unpredictable. So unlike LED and ALED here the low energy phonemes can be clearly detected.

Zero Crossing for each frame is calculated on the basis of:


Nzcs is the number of Zero Crosses detected in a frame. R is the set of values {5,6,7,..., 15}, the number of Zero crosses for speech frames of 10ms.

This is incorporated in ALED. The Zero Crossing Detector (ZCD) checks the voice activity of the frames that were declared to be INACTIVE by ALED.[4] Thus, ZCD recovers almost all the low-energy speech phonemes that were otherwise silenced.


A ZCD often makes incorrect decisions as noise frames may have the same number of zero crossings as in speech frames.

Chapter 3: Frequency Domain Characteristics

Frequency domain Techniques have following parameters which are measured:



Depending upon these parameters Frequency Domain Technique use following algorithms:

LSED: Linear Sub-Band Energy Detector

SFD: Spectral Flatness Detector

CVAD: Comprehensive VAD.

For Spectrum Computation we use Discrete Cosine Transform.


Discrete Fourier Transform:

This transform is a part of Fourier analysis which is used in transforming a function from time domain to frequency domain. The difference between a Fourier Transform and a DFT is that unlike FT DFT takes input with discrete finite values. These signals can be produced by sampling a continuous signal[3].

DFT Formulae:

Complex Sequence: ,….., is transformed using


Where k= 0,1…..N-1

Discrete Cosine Transform

DCT is used to represent infinite number of data points in summation of cosine functions at different frequencies. DCT is preferred over DST (Discrete Sine Transform) as cosine is more efficient compared to sine wave. I.e. fewer signals are used in signal approximation.

The difference between a DFT and DCT is that DFT works on complex numbers whereas DCT works on real valued numbers.

DCT Formulae:

Real Sequence: ,….., is transformed using


Where k= 0, 1…..N-1

There are 8 types of different DCTs but DCT-II is mostly used in Signal and image processing. DCT-II is also called simply DCT.

The main reasons why DCT is preferred over DFT in Frequency Domain Techniques are as following:

DCT offers less complexity compared to DFT

It's a real valued transform as it uses real values compared to complex values in DCT.

Now let us talk about the frequency domain techniques.

3.1 Linear Sub Band Energy Detector

This algorithm is similar to LED algorithm of Time domain Characteristics. In this method the signal spectrum is divided in four distinct frequency bands with following widths:

0-1 kHz

1-2 kHz

2-3 kHz

3-4 kHz

The decision will be based on energy calculation and comparison of each frame with a reference. In this case the reference will be a energy threshold value in frequency domain.

The frequency counterpart can be found using the following formula.


For nth band, we have


Now for the presence of speech in each band the condition is:


The threshold for each frame needs to be updated as we compare the energy levels so that the system is updated with every increase in the noise level[5]. To update the energy threshold we use the following block diagram.

Fig 5: Energy Threshold Update

So the updated energy equation is:


The New Energy thresholds are computed in a recursive fashion. But for each band it is done separately as convex combination.

Fig 6: Block Diagram for LSED

Fraction of Energy in Lowest Frequency Band

As seen in the above flowchart it is shown that the spectral band with 0-1kHz is not being compared to the other 3. This is because of the reason that the whole of the energy of voice signal remain in the lower frequency spectral band (0-1 kHz)[7]. Whereas the higher frequency bands will contain noise and might contain few nasal voices. So if the lower band and higher bands are compared then we can decide wether the frame which is being processed is active or inactive.

Rule Of Speech

In order to declare a frame as active or inactive if the lowest frequency band is active and any of the other two bands are active.

If the lower band is inactive then the frame is inactive regardless of the other higher frequency bands[6]. If the lower frequency band is active and two higher frequency bands are inactive then also the frame will remain inactive.

This method faces the same problem which is faced by other algorithms too. This is the poor performance in lower SNR signals. Low energy phonemes can't be detected. Especially dipthongs are difficult to be detected.

However if this method is embedded in Weak Fricative Detection (Time Domain) then the performance in lower SNR signals can be drastically improved.

Spectral Flatness Detector

This algorithm works on the basic principle of power spectral density.

3.2.1 Power Spectral Density:

In an analogy to the energy signals, let us define a function that would give us some indication of the relative power contributions at various frequencies, as Sf (á½ )[3]. This function has units of power per Hz and its integral yields the power in f(t) and is known as power spectral density function.


Fig 7: PSD for Noise

The above figure shows the PSD of periodic random noise and white noise together (* taken in MATLAB)

SFD differentiates between voices and noise using the spectral density. As white noise has flat PSD, voiced signals have a non stationary spectral density with more spectral content within lower frequencies. To differentiate between speech and noise content we can use the variance method (same as Time Domain).


High Variance refers to Speech content whereas low variance implies to noise alone.

Now similar to other algorithms we compare the variance of each frame with that of threshold variance. If it is more than the threshold then that frame is ACTIVE otherwise the INACTIVE frame will be used in updating the threshold value.

For nth frame:

IF >> Active Frame

ELSE >> Inactive Frame

The threshold is updated during silence using convex combination using the following equation.


This algorithm is more efficient when we have signals with lower SNRs. Unlike the LSED method because this method uses a statistical method to energy distribution.

CVAD: Comprehensive VAD

As seen in earlier algorithms each of them exploited only a few characteristics of speech. In order to obtain better results in reconstructed quality we take all the methods in consideration and incorporate them into one algorithm. This algorithm identifies white noise and frequency selective noise and maintains excellent speech quality. In this algorithm the parameter calculation remain same only change is the decision rule. The decision rule changes based on the priority of the algorithms. The Energy method is given the highest priority. The flowchart is as follows.

Fig8: CVAD Flow Chart

In order to understand the above flow chart let us consider one frame at a time.

Each frame is passed through multi band energy comparator where its energy is compared to a threshold value. If the energy for that frame is high enough then it sets the frame as Active Frame.

If this fails then the signal is passed through a Zero crossing Detector so that sounds with lower energy values can be detected like dipthongs.

ZCD sometimes passes the noise which has same ZCs of that of voice signals.

In that case the signal is passed through a SFD so that it can differentiate between white noise and voice signal.

If the signal fails any of these then the signal is an INACTIVE frame and it is used to update the threshold values.

So by summarizing we come to the conclusion that a CVAD takes into account (priority wise)

Energy Detection

Zero Crossing Detection

Spectral Flatness Detection


This process has got better results as compared to other algorithms.


The only demerit of this algorithm is that though it's such a complex process computing various parameters its performance in lower SNR is not that satisfactory.

Chapter 4: Comparison of Time and Frequency Domain


All the algorithms were used using MATLAB on various signals. These signals varied in background noise, loudness, Speech continuity and accent. A study was done on the basis of following parameters.

Percentage Compression: It is the measure of how efficiently can a VAD system detects and eliminated INACTIVE frames. it is given by the formula:

Subjective Speech Quality: All the speech samples were given a rating on a scale of 5 (1-poor and 5-best, and 4-toll grade quality).

Floating Point Operations Required: It is the measure of algorithms performance as it uses various floating point calculations. It is analogous to the earlier instructions per second. This helps in comparing the applicability of the algorithms in real time implementation.

Objective Assessment of Misdetection

Every Erroneous decision in which the VAD classifies a voice contained frame as INACTIVE and noise contained frame as ACTIVE we have a situation of Misdetection which should be very less for an efficient VAD.

Misdetection Percentage is given by:

Where SEF is Sum of Erroneous Inactive and Active Frames.

MOS (Mean Opinion Square) is the appropriate method of scoring the speech quality.

So after studying all the algorithms and the important parameters we can define the exact definition for a VAD algorithm.

Definition: An Efficient VAD should maintain acceptable Speech quality with high compression and lower no of FLOPS.

The following figures given below will illustrate all the six algorithms WRT percentage compression, Speech Quality, no. of flops and misdetection.

However there is a normalization done for all the algorithms as the no of flops will be different for all the algorithms and substantially high for CVAD. So the data is normalized and scaled to 100[5].

We have taken three different speech signals


Discontinuous Monologue with low-energy phonemes.

Rapidly spoken accented monologue.


Discontinuous Monologue with low energy phonemes

Rapidly Spoken Accented Monologue.

Trends Observed in Implementation and testing:

Lowest no of FLOPS were observed in time domain algorithm as it was less complicated and easier calculation compared to frequency domain algorithms.

Energy Dependent algorithms failed to give speech quality whereas Spectral flatness and ZCD gave better speech quality.

Low energy phonemes which were rejected by Energy Detector were detected by Zero Crossing Detection. However noise with same ZCD as that of voice was also picked up.

Except the last two algorithms all of them were affected by lower SNR signals.

Spectral Flatness Detection is most efficient in speech detection at lower SNR signals.

4.5 In Time Domain Algorithms:

LED is less complex with lesser calculation requirements; however the quality stand low compared to other algorithms.

The above problem can be reduced easily using ALED but it results in reduced signal compression and higher no. of FLOPS.

WFD follows the same trend but has got better quality as compared to other Time Domain Algorithms.

4.6 In Frequency Domain Algorithms:

LSED gives better results in speech quality as compared to its analogous algorithm in time domain algorithm i.e. LED. But LSED fails to perform at lower SNR signals.

SFD works well with lower SNR signals as it depends on Power Spectral Density.

CVAD provides excellent speech quality compared to the other FD Algorithms but its performance in lower SNR conditions is not satisfactory even though it uses high computations (Increased FLOPS).


5.1 Computation of ZCR

% Computation of ZCR of a speech signal.

% Author: Ashish Sharma

% Date: 2010/05/10

[x,Fs] = wavread('sample.wav');

x = x.';

N = length(x); % signal length

n = 0:N-1;

ts = n*(1/Fs); % time for signal

% define the window

wintype = 'rectwin';

winlen = 201;

winamp = [0.5,1]*(1/winlen);

% find the zero-crossing rate

zc = zerocross(x,wintype,winamp(1),winlen);

% time index for the ST-ZCR after delay compensation

out = (winlen-1)/2:(N+winlen-1)-(winlen-1)/2;

t = (out-(winlen-1)/2)*(1/Fs);


plot(ts,x); hold on;

plot(t,zc(out),'r','Linewidth',2); xlabel('t, seconds');ylabel ('ZC');

title('Zero Crossing Rate');


5.2 Functions Used:

Wavread: this function returns the sample data and sampling rate(HZ) which is used to encode the data.

Ex: [x,Fs] = wavread('so.wav');return the data to x and sampling rate to Fs

Length() : Returns the length of the signal in s.

Rectwin: this will return a rectangular window of length l in to a column.

Plot(x,y): This function plots the signal in a X vs. Y graph

A Custom Function called ZC that calculates zero crossing rate.

Parameters Used are:

Wintype: Type of the window.

Winlen: Length of the window.

Winamp: Amplitude of the window.

Two graphs are plotted at once. One represents the signal and other represents the Zero Crossing Rate.


Given Below is the plot for Zero Crossing Vs time.

Fig 9: ZCR Vs Time

From the figure it is easily shown that at the starting of the word "SO" there is presence of noise due to which we get a high zero crossing in that particular frame. Whereas in the region where the word starts we have very less zero crossing detection.

Therefor as stated in earlier studies the ZCD algorithms which are WFD (Time Domain ) and CVAD( Frequency Domain) can easily detect the voice signal out of noise.

However as we can see near the ends where the voice signal is not present the ZC is same as that of voice signal and hence even this is taken into account by VAD.


In this a normal Energy based detector is shown which calculates the energy of a signal.

Components Used:

Audio Device : Takes the Audio I/P from the Microphone

Vector Scope: Plots the Signal Spectrum

Matrix Square: Square the magnitude of the Audio

Energy Scope: Plots the Energy Graph

Audio O/P: O/P the Audio after the sound lag

Display: Displays the Energy Magnitude.

5.5 Computation of Energy Threshold

% Computation of Energy of a speech signal.

% Author: Ashish Sharma

% Date: 2010/05/10

[x,Fs] = wavread('so.wav'); % word is: so

x = x.';

N = length(x); % signal length

n = 0:N-1;

ts = n*(1/Fs); % time for signal

% define the window

wintype = 'rectwin';

winlen = 201;

winamp = [0.5,1]*(1/winlen);

% find the Energy

E = energy(x,wintype,winamp(2),winlen);

% time index for Energy after delay compensation

out = (winlen-1)/2:(N+winlen-1)-(winlen-1)/2;

t = (out-(winlen-1)/2)*(1/Fs);


plot(ts,x); hold on;

plot(t,E(out),'r','Linewidth',2); xlabel('t, seconds');



5.6 Functions Used:

Wavread: this function returns the sample data and sampling rate(HZ) which is used to encode the data.

Ex: [x,Fs] = wavread('so.wav');return the data to x and sampling rate to Fs

Length() : Returns the length of the signal in s.

Rectwin: this will return a rectangular window of length l in to a column.

Plot(x,y): This function plots the signal in a X vs. Y graph

A Custom Function called E that calculates Energy of signal.

Parameters Used are:

Wintype: Type of the window.

Winlen: Length of the window.

Winamp: Amplitude of the window.

Two graphs are plotted at once. One represents the signal and other represents the Energy

5.7 Energy Plot

Given Below is the plot for Energy Vs time.

Fig 10: Energy Vs Time

From the figure it is easily shown that in the presence of noise we have a lesser Energy. Whereas in the region where the word starts we have high energy detection.

Therefor as stated in earlier studies the Energy algorithms which are LED (Time Domain) and ALED( Frequency Domain) can easily detect the voice signal out of noise.

As we can see the lower valued voice signals also have the same energy level as that of a noise which will be problematic in detecting voice out of noise.


This is an example of a simple Zero Crossing Detector

Components Used:

Audio Device : Takes the Audio I/P from the Microphone

Zero Crossing Detector: Return the Zero Crossing Count

ZC Scope: Plots the ZC Graph

Display: Displays the ZC Counts.

Chapter 6: Conclusion

Till Mid Semester we studied and performed time domain characteristics where we studied parameters like energy threshold and Zero Crossing Rate.

Time Domain Algorithms:

Linear Energy Based Detector

Adaptive Energy Based Detector

Weak Fricative Detectors

In the next half we studied Frequency domain Algorithms which includes Parameters Like



Frequency Domain Algorithms

Adaptive Linear Energy Detector

Spectral Flatness Detector

Comprehensive VAD

The Spectrum Based Detector is efficient in signals with Lower SNR values and VAD which uses Energy, Spectrum and Zero crossing Detection is the most effective out of all the above algorithms.

With the help of these today services like VOIP and Audio Messages have come into existence though not that successful but have been found of great help and cost effective indeed especially for people who have more international call usage. For them it has proved to be a boon. However due to less research and advancement of these algorithms this technology is not popular. But companies like SKYPE have taken it in use and is serving the market well.

The time domain algorithms are less complex compared to the Frequency domain algorithms but the speech quality and VAD performance is better if frequency algorithms are used. There are still improvements to be done in CVAD so that it can provide high efficiency to VAD even with the lower SNR signals.