BSMRNN For Speech Synthesis Using Haar Wavelet Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

ABSTRACT- Bidirectional Segmented Memory Recurrent Neural Network (BSMRNN) is an approach that provides an optimal solution through which the process of memorizing is segmented. We test the performance of BSMRNN on the information latching problem, the "two-sequence problem" and the problem of speech synthesis. Besides, we also theoretically analyze how the segmented memory of BSMRNN helps learning long-term temporal dependencies and study the impact of the segment length in speech synthesis. The text files are converted to wavelets file and it is segmented as contextual information at every process. The segment lengths optimize the memorization process. Wavelets can keep track of time and frequency information. They can be used to "zoom in" on the short bursts, or to "zoom out" to detect long, slow oscillations. The first module performs text to speech signal conversion using Haar wavelet analysis. The Haar representation and a number of related representations derived from it are suitable for direct comparison. The second module executes those wavelet recognition to be segmented and sent as an input to BSMRNN. Reducing the number of epochs necessary for convergence and to increase the speech processing speed is its main objective.

Index Terms-Wavelets , information latching, long-term dependencies, recurrent neural networks (RNNs), segmented memory, speech synthesis, HAAR wavelet analysis.


THE standard structural framework of recurrent neural networks does not adequately model long-term dependencies .Many researchers have reported this limitation of recurrent neural networks [1]. The transformation of MS-Windows interface to spatial

audio has been proposed by Crispien [6] and Sodnik [7], both transforming the hierarchical navigation scheme into a "ring" metaphor. The necessary conditions of robust information latching result in a problem of vanishing gradients, making the task of learning long-term dependencies difficult. Several approaches have been suggested to circumvent the problem of vanishing gradients. Some consider alternative network architectures, such as second-order recurrent neural networks, nonlinear autoregressive models with exogenous (NARX) recurrent neural network [5], [7], hierarchical recurrent neural network, and long short-term memory (LSTM). In order to tackle the long-term dependency problems in speech processing, we propose a novel recurrent architecture named Bidirectional segmented-memory recurrent neural network (SMRNN)[6] and develop a learning strategy to construct the BSMRNN. We first theoretically analyzed the behavior of SMRNN and tested its performance in Speech Synthesis [6]. Furthermore, we carried out experiments on artificially generated sequential processing tasks and real-world problems. Both our theoretical and experimental results indicate that SMRNN improve performance on problems that involve long-term dependencies. Some preliminary results of our model have been reported in [4] and [5]. In pervasive services using mobile devices, the calculation

cost for various sensor signal processing and recognition

tasks such as speech detection should be kept as low as

possible [9].

This paper is organized as follows. Section II states our motivation to propose BSMRNN. Section III presents detailed description of BSMRNN. Section IV discusses the system design and how speech synthesis is apply in BSMRNN. Section V explains the implementation using screen shots. Section VII gives concluding remarks.


As we observe, when people memorize long numbers or long sentences, they tend to do so in segments. For instance, each time they try to memorize ten digits or subject-predicate-object in the sentence sequentially. During the process of human memorization of a long sequence, people tend to break it into a few segments, whereby people memorize each segment first and then cascade them to form the final sequence .The process of memorizing a sequence in segments is illustrated in Fig. 1. In Fig. 1, the substrings in parentheses represent segments of length d1, d2 and d3; gray arrows indicate the update of contextual information associated to memory symbols

Fig.1 Illustration of Segmented

and black arrows indicate the update of contextual information associated to segments; numbers under the arrows indicate the sequencing of memorization. , d1,d2,and d3 are not necessarily equal to one another. The segment length can be fixed or vary from segment to segment.

BSMRNN is not the first method that involves sequence decomposition. Schmidhuber's hierarchical chunker system uses the principle of history compression to compresses sequences. For such problems as the latching problem and the "two-sequence" problem, which do not have local regularities, the chunker fails to capture the long time-lagged dependencies.


Based on the observation on human memorization, we believe that RNNs are more capable of capturing long-term dependencies if they have segmented memory and imitate the way of human memorization. Following this intuitive idea, we propose a novel recurrent network named segmented-memory recurrent neural network (abbreviated as SMRNN).

Fig. 2.a. Segmented-memory recurrent neural network

Architecture of BSMRNN

The architecture of SMRNN is illustrated in Fig 2.a.The SMRNN has two hidden layers, namely, hidden layer H1 and hidden layer H2. The layer H1 represents symbol-level state and the layer H2 represents segment-level state. Layer H1 has recurrent connection to itself. The symbol-level state stored in layer H1 for the previous symbols is fed back and stored in con- text layer S1. Layer H2 also has recurrent connections to itself. The segment-level state stored in layer H2 for the previous segments is fed back and stored in context layer S2. Context layer S1 and context layer S2 store contextual information at symbol level and at segment level, respectively. Most importantly, we introduce into the network a new attribute interval to denote the segment length that can be fixed or variable.

The architecture of BSMRNN is illustrated in Fig 2.b. The BSMRN uses a forward SMRNN and a backward SMRNN to capture upstream and downstream context, respectively. The final output is obtained from a feedforward subnetwork which combines the upstream context, the residue of interest, and the downstream context.


Fig. 2.b.Bidirectional Segmented-memory recurrent neural network

B. Dynamics of SMRNN

In this section, we formulate the dynamics of SMRNN to implement the segmented memory illustrated in Fig. 1. The symbol-level state is initialized by, where g is a sigmoid function and σ0k is a parameter to be optimized during training. Suppose a sequence of symbols is fed to SMRNN one symbol per cycle, then the symbol-level state is updated at the arrival of each symbol. At the beginning of each segment, the symbol-level state x k is obtained from the initial symbol-level state x 0 and input ut; while at other position, is obtained from previous state and input . Let u ti be the ith input at cycle t, then the symbol-level state at cycle is calculated by

where segment head (SH) refers to the beginning of a segment. The segment-level state is initialized at the beginning of the sequence by , where is a parameter to be trained. Due to the insertion of intervals into the memory of context, the segment-level state is updated only at the end of each segment.

where segment tail (ST) denotes the end of a segment.

Fig. 3. Dynamics of SMRNN.

The segment-level contextual information contained in vector y t is forwarded to the output layer to produce an output. When (the lth symbol in the input symbol set) is read, the lth element of the input vector is ( is 1 if

and 0 otherwise). Normally, the synaptic input equals the sum of inputs multiplied by weights. Hence, if an input is zero, the weights associated to that input unit are not updated during training.

C. BSMRNN learning strategy

The BSMRNN is trained using an extension of the real-time recurrent learning algorithm. In some early RNN architectures, the weights are initialized with random values and then trained, but the initial states of the hidden neurons x k and y k do not change during training. As remarked in [11], it is not reasonable to make the initial states fixed, and the behavior of the network improves if they are also learned. In order to keep x kt and y kt within the range [0,1] during gradient descent, we define the synaptic inputs σkt =Σjnx=1 Wkjxx x kt-1 + Σjnu=1 Wkjxu u it and ύkt =Σjny=1 Wkjyy y jt-1 + Σinx=1 Wkiyx x it such that

xkt=g(σkt) and ykt=g(ύkt), and take and as the parameters to be optimized.

The learning is based on minimizing the sum of squared error cost function

where z k-tis the desired output and z kt is the actual output. Every parameter P, including Wkjzy, Wkjyy , Wkiyx , Wkjxx , Wkjxu , σk0 , and ύk0 is initialized with small random values and then updated according to gradient descent.

With a learning rate α and a momentum η. The value

∆'P is the variation of P in the previous iteration.


For testing the performance of SMRNN, we choose three well-known benchmark problems. They are: 1) the problem of phoneme recognition which in turn is used for speech processing 2) the information latching problem, 3) the "two-sequence problem,".

Information latching problem

Two sequence problem

HAAR Wavelet Analysis

OUTPUT(wave) (((Wavelets)



Speech Synthesis


Fig.4 System Design

In figure 4 the input is sent to the system and converter into speech and that wave file is again sent to BSMRNN to attain long term dependence.

System Architecture For Speech Synthesis

The system architecture consists of five major modules:

four pieces of software (libraries and Java classes) and a standard sound card with hardware support for sound.

The software part is based on Java programming language with some additional external plug-ins. The five modules are:

• FreeTTS: speech synthesis system written entirely in


• JOAL: Implementation of the Java bindings for

OpenAL API [16][17];

• HRTF library from MIT Media Lab (measurements of

KEMAR dummy head) [18];

• a custom made signal processing module and

• Creative Sound Blaster X-Fi ExtremeGamer sound


Fig:5 process of speech synthesis

a. FreeTTS

With FreeTTS included in any Java application, one has to define the instance of the class voice. The selected voice is determined through the class called voiceManager. The speech synthesis can then be done simply by calling a method speak (a method of voice class) with the text to be spoken as parameter. By default Java uses its default audio player for playback which then sends the synthesized sound directly to the default sound card. In our case, we wanted the output to be a buffer of sound samples for further processing. We had to develop a custom audio player by implementing Java class AudioPlayer. Our audio player named BufferPlayer outputs synthesized speech as an array of bytes. The output is an array of 16-bit samples with little endian byte order. Each sound sample is presented as signed integer with two's complement binary form (amplitude value between 215 and -215).

b. MIT Media Lab HRTF library

HRTFs (Head Related Transfer Functions) are transfer

functions of head-related impulse responses (HRIRs) that describe how a sound wave is filtered by diffraction and reflection of torso, head and pinna when it travels from the sound source to the human eardrum. These impulse responses are usually measured by inserting microphones in human ears or by using dummy heads. The measurements for different spatial positions are gathered in various databases or libraries and can be used as filters for creating and playing spatial sounds through the headphones. A separate function has to be used for each individual ear and each spatial position.

MIT Media Lab library, which was used in our system,

contains HRTF measurements for 710 spatial positions:

azimuths from -1800 to 1800 and elevations from -400 to 900. The MIT library is available online in various formats and sizes and it contains individualized as well as generalized functions (measured with a dummy head). We used the library in order to improve the elevation positioning of the sound sources.

Fig:6 elevator settings

The elevation localization depends strongly on

individualized human factors: torso, shoulder and pinna

shape. Ideally, individualized HRTF should be used for each user, but this is virtually impossible when the system is intended to be used by a large number of users. We therefore used the generalized compact measurements in wav format. Only 14 functions were used in our application: elevations from -400 to 900 at azimuth 00. At azimuth 00, the same function can be used for both ears. The azimuth positioning was done by JOAL library.

c. Signal processing module

The MIT HRTF library contains functions in PCM format consisting of 16-bit samples at 44.1 kHz sampling frequency. In order to be used with the output of FreeTTS, the samples had to be down-sampled to 16 kHz. In Java, the HRTFs were defined as arrays of bytes. The filtering of synthesized samples with HRTF was done by calculating the convolution between the two arrays. In order to calculate the correct convolution, both arrays had to be converted to arrays of float numbers. After convolution, the resulting array was converted to a 16-bit array of unsigned samples with big endian byte order, appropriate for input to JOAL library functions.

d. JOAL library

JOAL is a reference implementation of the Java bindings for OpenAL API. OpenAL is a cross-platform 3D audio API appropriate for use in various audio applications. The library models a collection of audio sources moving in a 3D space that are heard by a single listener located somewhere in that space. The positioning of the listener or the sources is done by simply defining their coordinates in the Cartesian coordinate system. The actual processing of input sounds is done by software (the library itself) or hardware (if provided JOAL provides the functions for reading external wav files which are then positioned and played. The samples from wave files are then written to special buffers and the buffers are attached to the sources with specific spatial characteristics. In our case, the input to JOAL was an array of samples (FreeTTS's output convoluted with HRTF), which was then written directly to buffers. By using only JOAL library for spatial positioning, the spatial arrangement of the sources and the listener could be changed at any time. However, in our case only the changes of horizontal position can be preformed dynamically through JOAL. The vertical changes, on the other hand, require preprocessing with new HRTF.

e. Soundcard

Creative Sound Blaster X-Fi Extreme Gamer was used

within the system. The soundcard has a special DSP unit called CMSS-3D, which offers hardware support for spatial sound generation. CMSS is another type of a generalized HRTF library used for filtering input sounds. In general, the soundcard can be configured for output to various speaker configurations (i.e. headphones, desktop speakers, 5.1 surround, etc.), but in our case the use of additional HRTF library required the playback through stereo headphones. Creative Sound Card works well with JOAL (OpenAL) positioning library. However, if no hardware support can be found for spatial sound, the latter is performed by the library itself (with a certain degradation of quality).

HAAR wavelets For Speech Synthesis:

A Haar wavelet is the simplest type of wavelet [9]. In

discrete form, Haar wavelets are related to a mathematical operation called the Haar transform. The Haar transform serves as a prototype for all other wavelet transforms. Like all wavelet transforms, the Haar transform decomposes a discrete signal into two subsignals of half its length. One subsignal is a running average or trend; the other subsignal is a running difference or fluctuation.

The Haar wavelet transform has a number of advantages [9]:

• It is conceptually simple.

• It is fast.

• It is memory efficient, since it can be calculated in

place without a temporary Array.

• It is exactly reversible without the edge effects that are a problem with other Wavelet transforms.

The Haar transform also has limitations [10], which can be a problem with for some applications. In generating each of averages for the next level and each set of coefficients, the Haar transform performs an average and difference on a pair of values. Then the algorithm shifts over by two values and calculates another average and difference on the next pair. The high frequency coefficient spectrum should reflect all high

frequency changes. The Haar window is only two elements wide. If a big change takes place from an even value to an odd value, the change will not be reflected in the high frequency coefficients. The audio de-noising by Haar wavelet is always soo effective, because the transform can't compress the energy of the original signal into a few high-energy values lying above the noise threshold.

Fig:7 Haar like filter for sound signals

To apply Haar-like filtering to sound signals, we used simple one dimensional difference filter hm as in Figure 4 with its filter and filtering shift width variable. Controlling these degrees of freedom will adapt the feature search spaces to fit to a certain recognition problem. Since this filter has its coefficients of -1 or +1, the computation required for the filtering process can be limited. The feature value used to distinguish is the sum of the absolute output of Haar-like filtered signals and calculated by the following equations,

but the actual calculation utilizes integral signal technique

In the sense of controlling the degrees of freedom, the

maximum values for the filter and the shift width are given. The shift width is controlled by maximum shift rate α.

WShiftMax = α . WFilterMax

When α =0, WShiftMax is set to 1. From the initial

experiments, we noticed that different set of filters with

different classification property is selected to the

speech/nonspeech classification problem when the filters' degrees of freedom are altered. The minimum train error yielding Haar-like filter set is selected. The feature vector is formed by concatenating the outputs of the selected Haar-like filters. In this study, the

classifier used was kept to the conventional LBG vector

quantizer [5] for pure comparison of extracted features'

classification property among various feature extraction

methods.As noted in the equation (1), the computation required in Haar-like filtering is very simple. The filter's coefficient ineach of its slot is either +1 or -1. Therefore, no multiply computation is needed in the feature extraction process.

Fig 8: converted wavelet for speech

C.The Information Latching Problem

Standard system theory approach involves the modeling of outputs produced at different times in response to the inputs. The inputs as well as the outputs form a sequence indexed by time. An important problem in modeling of such a system is to capture "information" about input-output relationship and represent it in a suitable model. For typical dynamic systems, it is generally assumed that only the current or immediate past input significantly changes the output. However, in many situations, we are interested in modeling how the inputs at much earlier times affect the current output behavior. Bengio et al. [1] introduced an approach to study the strengths and weaknesses of such a capability of a model to "capture" dependencies of output on much earlier inputs. Specifically, the term "information latching" in this context refers to the long-term storage of definite bits of information into the state variables of a dynamic system. The information latching problem is a minimal task designed by Bengio as a test that must be passed in order for a dynamic system to latch information robustly [1]. This task is to classify two different classes of sequences. The class of sequence X1, X2,…. XL depends only on the first Lvalues of the sequence

D. BSMRNN "Two-Sequence Problem"

Hochreiter and Schmidhuber have proposed the LSTM algorithm, which is able to bridge long time lags [18]. LSTM is the state of the art in recurrent networks. In order to compare SMRNN with LSTM, we carry out experiments on the "two-sequence problem" which has been used to evaluate the LSTM algorithm. The "two-sequence problem" is to observe and then classify input sequences into two classes, each occurring with probability 0.5. Given a constant , the sequence length is randomly selected between and . Only the first real-valued sequence elements convey relevant information about the class. Sequence elements at position are generated by a Gaussian with mean zero and standard deviation 0.2. Case: the first sequence element is 1.0 for class 1 and 1.0 for class 2. Case: the first three elements are 1.0 for class 1 and 1.0 for class 2. The target output at the sequence end is 1.0 for class 1 and 0.0 for class 2. Correct classification is defined as "absolute output error at sequence end below 0.2." We stop training according to the following criteria: none of 256 sequences from a randomly chosen test set is misclassified. As indicated in Table VII [18], LSTM achieved quite low "fraction misclassified" of for and for , respectively. The results achieved by SMRNN are competitive with those of Long short term Model(LSTM). Besides, BSMRNN reaches the stopping criteria using less training epochs than LSTM.


This work is implemented in Enterprise java bean in which the text file is converter to wave file using Haar algorithm. The output is trained and tested to check the dependency in BSMRNN

Fig:9 screen shot for wavelet

Fig:10 screen shot for segmenting the inputs


SMRNN can learn to bridge long time lags even in noisy, highly unpredictable environments, without loss of short time lag capabilities. As shown by experiments on information latching, BSMRNN converges fast and generalizes well for short sequences as well. Speech recognition has a big potential in becoming an important factor of interaction between human and computer in the near future. A system has been proposed to combine the advantages BSMRNN and Haar's for speaker independent speech processing. The parameters of the BSMRNN and Haar subsystems can influence each other. By inserting intervals into the memory of contextual information, BSMRNNs perform better on long-term dependency problems than existing recurrent networks without segmented memory. SMRNN achieves much higher accuracies than Elman's network on the information latching problem. SMRNN's accuracies on the "two-sequence problem" are competitive with that Conventional model.