What Is Text To Speech Computer Science Essay

Published:

We use speech as the main communication media to communicate between ourselves in our day to day life. However, when it comes to interacting with computers, apart from watching and performing actions, majority of communication is achieved nowadays through reading the computer screen. It involves surfing the internet, reading emails, eBooks, research papers and many more and this is very time consuming. Nevertheless, visually impaired community in Sri Lanka is faced with much trouble communicating with computers since a suitable tool is not available for convenient use. As an appropriate solution to this problem, this project proposes an effective tool for Text-To-Speech conversion accommodating speech in native language.

What is text-to-speech?

Not everybody can read text when displayed on the screen or when printed. This may be because the person is partially sighted, or because they are not literate. These people can be helped by generating speech rather than by printing or displaying it, using a Text-to-Speech (TTS) System to produce the speech for the given text. A Text-To-Speech (TTS) system takes written text (can be from a web page, text editor, clipboard... etc.) as the input and convert it to an audible format so you can hear what is there in the text. It identifies and reads aloud what is being displayed on the screen. With a TTS application, one can listen to computer text in place of reading it. That means you can listen to your emails, eBooks while you do something else which result in saving your valuable time. Apart from time saving and empowering the visually impaired population, TTS can also be used to overcome the literacy barrier of the common masses, increase the possibilities of improved man-machine interaction through on-line newspaper reading from the internet and enhancing other information systems such as learning guides for students, IVR (Interactive Voice Recognition) systems, automated weather forecasting systems and so on [1][2].

What is "Sinhala Text To Speech"?

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

"Sinhala Text To Speech" is the system I selected as my final research project. As a post graduate student I selected a research project that will convert the Sinhala input text into a verbal form.

Actually, the term "Text-To-speech" (TTS) refers to the conversion of input text into a spoken utterance. The input is a Sinhala text, which may consist of a number of words, sentences, paragraphs, numbers and abbreviations. TTS engine should identify it without any ambiguity and generate the corresponding speech sound wave with acceptable quality. The output should be understandable for an average receiver without making much effort. This means that the output should be made as close as to the natural speech quality.

Speech is produced when air is forced from the lungs through the vocal cords (glottis) and along the vocal tract. Speech is split into a rapidly varying excitation signal and a slowly varying filter. The envelope of the power spectra contains the vocal tract information. [40]

The verbal form of in input should be understandable for the receiver. This means that the output will be made as closer as the natural human voice. The system will carry out few main features. Some of them are, after entering the text user will capable of selecting one of voice qualities, means women voice, male voice and child voice. Also the user is capable of doing variation in speed of the voice.

Actually, my project will carry out main few benefits to the users, those who intend to use this.

Below I have mentioned the basic architecture of project.

Sinhala Voice

Text in Sinhala

And

Voice and speed

Selection

Process

Figure 1.2

1.3 Why need "Sinhala Text To Speech"?

Since most commercial computer systems and applications are developed using English, usage and the benefits of those systems are limited only to the people with English literacy. Due to that fact, majority of world could not take the advantages of such applications. This scenario is also applicable to Sri Lanka as well. Though Sri Lankans have a high language literacy, computer and English language literacy in sub urban areas are bit low. Therefore the amount of benefits and the advantages which can be gained through computer and information systems are being kept away from people in rural areas. One way to overcome that would be through localization. For that "Sinhala Text To Speech" will act as a strong platform to boost up software localization and also to reduce the gap between computers and people.

AIMS AND OBJECTIVES

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

The main objective of the project is to develop a fully featured complete Sinhala Text to Speech system that gives a speech output similar to human voice while preserving the native prosodic characteristics in Sinhala language. The system will be having a female voice which is a huge requirement in the current localization software industry. It will act as the main platform for Sinhala Text To Speech and developers will have the benefit of building end user applications on top of that. This will benefit visually impaired population and people with low IT literacy of Sri Lanka by enabling convenient access of information such as reading emails, eBooks, website contents, documents and learning tutors. An end user windows application will be developed and it will act as a document reader as well as a screen reader.

To develop a system, that can able to read text in Sinhala format and covert it in to verbal (Sinhala) form. And also, It will capable to change the sound waves, It mean user would able to select voice quality according to his/her opinion. There are might be three voice selections. These are kind of female voice, kind of male voice and kind of kid's voice. And user can change the speed of the voice. If somebody needs to hear low speed voices or high-speed voice, then he/she can change it according to their requirements.

SPECIFIC STUDY OBJECTIVES

Produce a verbal format for the input Sinhala text.

Input Sinhala text which may be a user input or a given text document will be transformed in to sound waves, which is then output is captured by speakers. So the disabled people will be one of the most beneficial stakeholders of Sinhala Text to Speech system. Also undergraduates and research people who need to use more references can send the text to my system, just listen and grab what they need.

The output would be more like natural speech.

The human voice is a complex acoustic signal, which is generated by an air stream expelled at either mouth, nose or both. Important characteristics of the speech sound are speed, silence, accentuation and the level of energy output. The tongue appropriately controls the air steam, lips with the help of other articulators in the vocal system. Many variations of the speech signal are caused by the person's vocal system, in order to convey the meaning and emotion to the receiver who then understand the message. Also includes many other characteristics, which are in receiver's hearing system to identify what is being said.

Identify an efficient way of translating Sinhala text in to verbal form.

By developing this system we would be able to identify and proposed a most suitable algorithm, which can be used to translate Sinhala format to verbal form by a fast and efficient manner.

Control the voice speed and types of the voice (e.g. man, women, child voice, etc.).

Users would be capable of selecting the quality of the sound wave, which they want. Also they would be allowing reset the speed of the output as they need. People, those would like to learn Sinhala as their second language to learn elocution properly by changing the speed (reducing and increasing). So this will improve the listening capabilities.

Small kids can be encouraged to learn language by varying the speed and types.

Propose ways for that can be extended the current system further more for future needs.

This system only gives the basic functions. The system is feasible of enhancing further more in order to satisfy the changing requirements of the users. This can be embedded in to toys so can be used to improve children listening and elocution abilities. So those will Borden their speaking capacity.

RELEVANCE OF THE PROJECT

The thought of developing a Sinhala Text To Speech (STTS) engine have begun when I considering the opportunities available for Sinhala speaking users to grasp the benefit of Information and Computer Technology (ICT). In Sri Lanka more than 75% of population speaks in Sinhala, but it's very rare to find Sinhala softwares or Sinhala materials regarding ICT in market. This is directly effect to development of ICT in Sri Lanka.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

In present few Sinhala text to speech softwares are available but those have problems such as quality of sound, font schemas, pronunciation etc. Because of these problems developers are afraid to use those STTS for their applications. My focus on developing an engine that can convert Sinhala words in digitized form to Sinhala pronunciation with error free manner. This engine will help to develop some applications.

Some applications where STTS can be used

Document reader. An already digitized document (i.e. e-mails, e-books, newspapers, etc.) or a conventional document by scanned and produced through an optical character recognizer (OCR).

Aid to handicap person. The vision or voice impaired community can use the computers aided devices, directly to communicate with the world. The vision-impaired person can be informed by a STTS system. The voice-impaired person can communicate with others by providing a keypad and a STTS system.

Talking books & toys. Producing talking books & toys will boost the toys market and education.

Help assistant. Develop help assistant speaks in Sinhala like in MS Office help assistant.

Automated News casting. The future of entirely new breed of television networks that have programs hosted by computer-generated characters is possible.

Sinhala SMS reader. SMS consist of several abbreviations. If a system that read those messages it will help to receivers.

Language education. A high quality TTS system incorporated with a computer-aided device can be used as a tool, in learning a new language. These tools can help the learner to improve very quickly since he/she has the access to the correct pronunciation whenever needed.

Travelers guide. System that located inside the vehicle or mobile device that will give information current location & other relevant information incorporated with GPRS.

Alert systems. Systems that can be incorporated with a TTS system to attract the attention of the controlled elements since as humans are used to draw attention through voice.

Specially, countries like Sri Lanka, which is still struggling to harvest the ICT benefits, can use a Sinhala TTS engine as a solution to convey the information effectively. Users can get required information from their native language (i.e. by converting the text to native language text) would naturally move their thoughts to the achievable benefits and will be encouraged to use information technology much frequently.

Therefore the development of a TTS engine for Sinhala will bring personal benefits (e.g. aid for handicapped, language learning) in a social perspective and definitely a financial benefit in economic terms (e.g. virtual television networks, toys manufacture) for the users.

RESEARCH METHODOLOGY AND TECHNOLOGIES

We studied the Sinhala TTS implemented by the Language Technology Research Laboratory (LTRL) of University of Colombo School of Computing (UCSC) as the initial step. This was allowed us the opportunity to understand how the system works. The LTRL agreed to release the source code of the implemented system to us on request.

The mismatches in the currently implemented TTS system compared with the native Sinhala language with respect to the prosody studied. By analyzing the rule base and experiencing the system as an end user, improvements required to make the speech output of the system closer to the native language identified. This included the pattern matching of the words and checking the prosody of a word with respect to the position of the word in the sentence; such as a word behind a question mark, exclamation mark, comma or a full stop and depending on whether it's in beginning or an end of the sentence. We also planned to introduce the vocal from a female voice.

The identified mismatches and the improvements then modified to bring the system into a better version.

We aimed to develop the solution short time goals which allow having a sense of accomplishment. Having short term goals make life easier. Project review was a very useful and powerful way of adding a continuous improvement mechanism. The project supervisors are consulted on a regular basis for reviews and feed back in order to make right decisions, clear misunderstandings and carry out the future developments effectively and efficiently. Good planning and meeting follow up was crucial to make these reviews a success.

Database Technology

OO methodologies and Relational Database Management System (Microsoft® SQL Server™ 2008) used to develop centralized database on main server. A database management system, or DBMS, is software design to assist in maintaining and utilizing large collection of data [42]. The SQL Server 2008 is design to work as a data storage engine for thousands of concurrent users who connect over a network, it is also capable of working as a stand-along database directly on the same computer as an application [41]. DBMS provide some important functionality. Applications are independent from data representation, storage and location (data and location independence). DBMS is able to scan through millions of record and retrieve efficiently (efficient data access). DBMS enforce integrity constrain and security permission on the data (data integrity and security). DBMS provide facilities to data and its efficient accessibility (data administration). DBMS schedule concurrent access to the data in such manner that user can think of the data as being accessed by one user at a time. Further, DBMS protects users from the effects on of system failures (concurrent access and crash recovery). There for hoped to use Microsoft® SQL Server™ 2008 to develop voice and text information database.

Speech database is the major component of the system. In order to build the speech database, it will first make diphones list and then select sentences which contains diphones. After that, those sentences will be recorded and properly labeled for use in the Speech database.

Diphones of voices are stored in a speech database. When adding a new voice we have introduced the relevant files including diphones to the database. Even though we called it a database it is a set of files resided in the system.

Security

Integrity is a major concept in the area of security. In the aspect of integrity the system should work as the user expects. In this case user expects the system to read the input strings/words correctly. Therefore correctness of the output is very important in this system in the context of security.

Performance

A major requirement is that access to the stored information is fast and therefore, the synthesis system must be both fast and efficient. Since this system providing a speech output, the audible speech should not have any delays in the middle of the speech and the overall performance should be satisfactory.

When considering the performance of this type of systems, they should maintain a uniform flow of output. Therefore it is required to maintain a similar characteristic in the output stream. It is acceptable to have a reasonable delay before starting the reading.

BACKGROUND AND LITERATURE REVIEW

"Text to speech "is very popular area in computer science field. There are several research held on this area. Most of research base on "how to develop more natural speech for given text ". There are freely available text to speech package available in the world. But most of software develops for most common language like English, Japanese, Chinese languages. Even some software companies distribute "text to speech development tools "for English language as well. "Microsoft Speech SDK tool kit" is one of the examples for freely distributed tool kit developed by Microsoft for English language.

Nowadays, some universities and research labs doing their research project on "Text to speech". Carnegie Mellon University held their research focus on text to speech (TTS). They provide Open Source Speech Software, Tool kits, related publication and important techniques to undergraduate student and software developer as well. TCTS Lab also doing their research on this area. They introduced simple, but general functional diagram of a TTS system [39].

Image Credit: Thierry Dutoit.

Figure: A simple, but general functional diagram

Before the project initiation, a basic research was done to get familiarized with the TTS systems and to gather information about the existing such systems. Later a comprehensive literature survey was done in the fields of Sinhala language and its characteristics, Festival and Festvox, generic TTS architecture, building new synthetic voices, Festival and Windows integration and how to improve existing voices.

History of Speech Synthesizing

A historical analysis is useful to understand how the current systems work and how they have developed into their present form. History of synthesized speech from mechanical synthesis to the form of today's high-quality synthesizers and some milestones in synthesis related techniques will be discussed under History of Speech Synthesizing.

Efforts have been made over two hundred years ago to produce synthetic speech. In 1779, Russian Professor Christian Kratzenstein explained that differences between five vowels (/a/, /e/, /i/, /o/, and /u/) and constructed equipment to create them. Also, acoustic resonators which were alike to human vocal tract were built and activated with vibrating reeds. [16]

In 1791, "Acoustic-Mechanical Speech Machine" was introduced by Wolfgang von Kempelen which generated single and combinations of sounds. He described his studies on speech production and experiments with his speech machine in his publications. Pressure chamber for the lungs, a vibrating reed to act as vocal cords, and a leather tube for the vocal tract action were the crucial components of his machine and he was able to produce different vowel sounds by controlling the shape of the leather tube. Consonants were created by four separate restricted passages controlled by fingers and a model of vocal tract including hinged tongue and movable lips is used for plosive sounds.

In mid 1800's, Charles Wheatstone implemented a version of Kempelen's speaking machine which was capable of generating vowels, sounds, combinations of some sound and even full words. Vowels were generated using vibrating reed with all passages closed and consonants including nasals were generated with turbulent flow through an appropriate passage with reed-off.

In 1800's, Alexander Graham Bell and his father constructed a same kind of machine without any significant success. He changed vocal tract by hand to produce sounds using his dog between his legs and by making it growl.

No significant improvements on research and experiments with mechanical and semi electrical analogs of vocal systems were made until 1960s' [38].

The first fully electrical synthesis device was introduced by Stewart in 1922[17]. For the excitation, there was a buzzer in it and another two resonant circuits to model the acoustic resonances of the vocal tract. This machine was able to produce single static vowel sounds with two lowest formants. But it couldn't do any consonants or connected utterances. A similar kind of synthesizer was made by Wanger [27]. This device consisted of four electrical resonators connected parallel and it was also excited by a buzz-like source. The four outputs by resonators were combined in the proper amplitudes to produce vowel spectra. In 1932, Obata and Teshima, two researchers discovered the third formant in vowels [28]. The three first formants are generally considered to be enough for intelligible synthetic speech.

The first device that could be considered as a speech synthesizer was the VODER (Voice Operating DEmonstratoR) introduced by Homer Dudley in New York's Fair 1939 [17][27][29]. The VODER was inspired by the VOCODER (Voice CODER) which developed at the Bell Laboratories in mid-thirties which was mainly developed for the communication purpose. The VOCODER was built as voice transmitting device as an alternative for low band telephones and the VOCODER analyzed wideband speech, converted it into slowly varying control signals, sent those over a low-band phone line, and finally transformed those signals back into the original speech [36]. The VODER consisted of touch sensitive switches to control voice and a pedal to control the fundamental frequency.

After the demonstration of VODER demonstrating the ability of a machine to produce human voice intelligibly, the people were more interested in speech synthesis. In 1951, Franklin cooper and his associates developed a pattern playback synthesizer at the Haskins Laboratories [17] [29]. Its methodology was to reconvert recorded spectrogram patterns into sounds either in original or modified form. The spectrogram patterns were stored optically on the transparent belts.

The Formant synthesizer was introduced by Walter Lawrence in 1953 [17] and was named as PAT (Parametric Artificial Talker). It consisted of three electronic formant resonators connected in parallel. As an input signal, either a buzz or a noise was used. It could control the three formant frequencies, voicing amplitude, fundamental frequency, and noise amplitude. Approximately the same time, Gunner Fant introduced the first cascade formant synthesizer named OVE I ( Orator Verbis Electris). In 1962, Fant and Martony introduced an improved synthesizer named OVE II which consisted separate parts in it to model the transfer function of the vocal tract for vowels, nasals and obstruent consonants. The OVE projects were further improved and as a result OVE III and GLOVE introduced at the Kungliga Tekniska Högskolan (KTH), Sweden, and the present commercial Infovox system is originally descended from these [30][31][32].

There was a conversation between PAT and OVE on how the transfer function of the acoustic tube should be modeled, in parallel or in cascade. John Holmes introduced his parallel formant synthesizer in 1972 after studying these synthesizers for few years. The voice synthesis was so good that the average listener could not tell the difference between the synthesized and the natural one [17]. About a year later he introduced parallel formant synthesizer developed with JSRU (Joint Speech Research Unit) [33].

First articulator synthesizer was introduced in 1958 by George Rosen at the Massachusetts Institute of Technology, M.I.T. [17]. The DAVO (Dynamic Analog of the Vocal tract) was controlled by tape recording of control signals created by hand. The first experiments with Liner Predictive Coding (LPC) were made in mid 1960s [28].

The first full text-to-speech system for English was developed in the Electro technical Laboratory, Japan 1968 by Noriko Umeda and his companions [17]. The synthesis was based on an articulatory model and included a syntactic analysis module with some sophisticated heuristics. Though the system was intelligible it is yet monotonic.