Based Speech Recognition Features For Gamelan Instruments English Language Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this paper, features which usually used in automatic speech recognition (ASR) has been applied for gamelan instruments identification. In particular, spectral-envelope-based features, composed of spectral powers and their spectral derivatives has been compared to the established feature set which has been previously developed for signal and music analysis. This research compared ASR features with the spectral-power-based features. The performance and contribution of the ASR features has been analyzed using filter-based feature ranking and support vector machine (SVM). The priority or rank of the features were determined using ReliefF and Gain Ratio algorithms. The experiment showed that ASR feature have a significant role to the recognition rate. The first ten features predicted by the ranking techniques gives the same results. The first ten features are: fundamental frequency, spectral flux, spectral centroid, spectral roll off 90%, cepstral coefficient 10th, filter bank energy 10th, filter bank energy 8th, cepstral coefficient 8th, spectral kurtosis, spectral skewness, and spectral roll off 40%

Key words

support vector machine, automatic transcription, ReliefF, Gain Ratio, features selection

1. Introduction

The task of automated detection of gamelan instruments is still less in recent years, only few researches which analyzed gamelan instruments and their performances. Speech analysis has dominated the field of audio research and received more attention than music analysis. Researchers in musical analysis would look to the features applied in speech analysis. Spectral-Envelope-Based Speech Recognition Features have been used extensively in speech analysis [1] [2] [3]. This research tries to use automatic speech recognition (ASR) features for gamelan instruments identification. The selection of the best features is an important part of developing a good identification or classification system especially when using the machine learning or pattern recognition paradigm. There have been several studies on the importance and ranking of various features and feature sets for the task of musical instruments identification [4] [5]. In this research, three conventional ASR feature sets will be compared to the features set which has been previously developed for gamelan instruments identification [6]. The features set composed of spectral centroid, spectral rolloff 40%, spectral rolloff 90%, spectral flux, spectral skewness, spectral moment, and spectral kurtosis. Despite the fact that some features may be redundant, preliminary experiments confirmed that the SVM is not very sensitive to their presence. Initial tests showed that the first features gives more than 95% accuracy, and then the accuracy become low when using the first two features.

The process of gamelan transcription aims to convert a gamelan recording or performance into a gamelan notes or gamelan score. Gamelan notes is any system that represents the pitch and duration of a gamelan sound, through the use of written symbols. This research is part of the project aims to develop a system that extracts note events from gamelan sound files, i.e. real recordings including an undetermined number of notes at a time played by four gamelan instruments (demung, saron, peking, and bonang).

The rest of this paper is organized as follows. Section 2 describes the research method how to get the optimal spectral-based feature subset. Section 3 presents our experiments and discuss the results. Finally, Section 4 gives conclusions of our experiments.

2. Research Method

A general view of the system architecture is presented in Fig. 1. In the diagram, focus of this research are depicted (with dashed line) and detail presented in Fig. 2. The transcriptor accepts as input a WAV audio file (mono or stereo). The Pre-Processing module performs the cleaning of the gamelan recording. Then, the processed signal enters STFT and the Onset Detector module for determining the onsets. The onset times will be used to perform segmentation and features extraction. Based on the extracted features, the Pitch Identification estimates candidate pitch values and the proper instruments.

lena%258bit%25plain

Figure 1 Automatic Gamelan Notes

Transcription System Architecture

We used Short Time Fourier Transform (STFT) for the calculation of spectral features of audio clips based on the spectral envelope. Spectral envelope denotes a a curve in the frequency-amplitude plane that, derived from a Fourier magnitude spectrum [7]. Because Fourier Transform measure all frequency components, we used dominant frequency of the spectrum as a center clips of the spectral envelope. The envelopes were clipped from the range of to the Hz, and is the position of the center of the peak. Based on the spectral envelope, then we computed the values of all of the features. These features then combined as a feature vector of a gamelan sound. Finally, the feature vector were normalized by dividing each feature component by a constant number so the result is a real number between -1 and 1. The normalized feature vector were considered as the final representation of the gamelan sound (see Fig. 2).

When using Spectral-Envelope-Based Speech Recognition Features, it has been determined that several features chosen and compared. Features such as Melscale cepstral coefficients (MFCC) or cepstral coefficient (CC) [8] or frequency-filtered band energies (FF) [9] are almost exclusively used in speech analysis. The features have been proven to be capable of capturing the necessary information for speech recognition tasks. The features used in ASR, logarithmically scaled ratios of sub-band energies, that are localized in frequency (FF) or full-band (CC). The ASR features such as FF or CC, can be interpreted in terms of spectral derivatives, i.e., spectral slopes and spectral centroid [10]. FF and CC are ASR features that computed based on filter bank energies (FBEs). The FBEs are a discrete representation of the smoothed spectral energy distribution. FBEs can be converted to cepstral coefficients (CC) through discrete cosine transform (DCT). DCT maps the signal from time domain to the frequency domain. DCT is a Fourier-related transform similar to the discrete Fourier transform (DFT), but using only real numbers.

lena%258bit%25plain

Figure 2 Block diagram for extraction of the features

2.1 Time-frequencies Analysis and Segmentation

Spectrogram is a spectro-temporal representation of the sound. Spectrogram provide a time-frequency portrait of gamelan sounds. The STFT has been the commonly used method for generating time-frequency representations or spectrograms of musical signal. The result of STFT can be plotted on a 2D or 3D spectrogram as a function of both time and frequency, and magnitude is represented as the height of a 3D surface spectrogram instead of color or intensity in 2D spectrogram. However, STFT suffers from the common shortcoming that the length of the window determines the time and frequency resolution of the spectrograms [11] [12]. The size of the window is related to the time resolution and frequency resolution of STFT. The shorter the window, the higher the time resolution. For a long window, the frequency resolution is high, but the time resolution is low. For pitch analysis such as automatic gamelan note transcription, the frequency resolution of the spectrograms is more important than the time resolution [12]. Then STFT with long window is a quite powerful tool for automatic gamelan note transcription. For STFT, we used a hamming window with overlap of 60%, the window lengths were 4096 samples and the FFT length was 4096.

lena%258bit%25plain

Figure 3 Onset-based segmentation for 'Manyar Sewu'

Segmentation is an essential process in automatic gamelan notes transcription with a significant impact on pitch and instruments recognition performance. In this paper we used simple segmentation method based on onset information. It is possible to distinguish different note onsets in a gamelan recording. The example of segmented spectrogram based on the onset information can be seen in Fig. 3. The details of the segmented spectrogram can be seen in Fig. 4.

lena%258bit%25plain

Figure 4 Onset-based segmentation in detail

2.2 Feature Extraction and Ranking

When onset detection has been done, the audio signal should be fragemented into smaller sections or segments and proceed to extract features set that can describe these segments in a meaningful way. Based on the onset information, then ASR and the spectral-based features such as spectral centroid, spectral flux, mean and variance of the segment will be extracted and calculated.

In this paper, we provide 68 features (as shown at Table 1), such as: fundamental frequency, spectral centroid (), two spectral rolloff (), spectral flux (), spectral skewness (), spectral kurtosis (), spectral slope (), and spectral bandwidth (). Beside spectral-based features, we also extracted automatic speech recognition (ASR) features such as cepstral coefficients (CC) and filter bank energy (FBE).

Table 1 ASR and Spectral-based features

No

Features

Number of features

1

Fundamental frequency

1

2

Spectral centroid (Sc)

1

3-4

Spectral rolloff (Fc)

2

5

Spectral flux (Sf)

1

6

Spectral skewness (Sa)

1

7

Spectral moment (Sm)

1

8

Spectral kurtosis (Sk)

1

9

Spectral entropy (Se)

1

10

Spectral slope (Ss)

1

11

Spectral bandwidth (Sw)

1

12

Mean

1

13

Standard deviation

1

14

Mode

1

15

Median

1

16

Variance

1

17-25

Percentile

9

26-34

Quantile

9

35-41

Cepstral coefficients (CC)

17

42-68

Filter Bank Energy (FBE)

17

Cepstral coefficients (CC) are probably the most used spectral representation in speech. They are calculated by applying the discrete cosine transform (DCT) to the log-magnitude Fourier spectrum or FBEs. MFCCs are a way of representing the spectral information in a sound signals. Each coefficient has a value for each frame of the sound. The changes within each coefficient across the range of the sound are examined. The Mel scale is a perceptual scale that is based on human hearing. Seventeen MFCC coefficients will be extracted from gamelan sound signals. The MFCC parameters computation is performed in five steps [13].

The signal is mapped to the Mel scale filter bank consisting of triangular filters. In the last step an N-point inverse discrete cosine transformation is applied to the signal (see Equation 1).

(1)

In current filter-bank analysis, the extreme bands, that would be centered around and , are not considered in the computed energies . And the sequence should be extended by appending one zero at each end [12], i.e.

(2)

Frequency filtering (FF) [11] is a transformation of that set of spectral band energies consisting of a convolution between the sequence , from (2) and a given (impulse response) sequence to obtain a new sequence of filtered parameters , i.e.

(3)

The filtered parameters lies in the frequency domain, and only values are computed. It is assume that is either a first-order FIR filter or a second-order FIR filter centered around ; in this way, only the values from (2) are needed to compute in Equation (3).

All those features then combined as a feature vector of a gamelan sound. Finally, the feature vector is normalized by dividing each feature component by a constant number so the result is a real number between -1 and 1. The normalized feature vector is considered as the final representation of the gamelan sound.

When all the features has been determined and extracted, we performed feature comparation and ranking. The two filter-based feature ranking techniques has been used for comparation, those techniques are Gain Ratio () and ReliefF () that are available in the Weka data mining tool [17]. The rank of the features may represent the degrees of relevance, preference, or importance. The feature ranking and selection can help enhance accuracy in instruments identification and also reduce the dimension of feature space.

2.3 Datasets

Javanese gamelan is an ensemble of percussion instruments that mostly metallophone [16], xylophones, and gong type instruments which produce tones when struck with horn or wooden mallets. A complete set of gamelan consist of 72 instruments [17], for example: kendang, saron group, bonang group, kethuk-kenong and gongs. Group of saron consist of saron demung, saron barung, and saron panerus (peking). Those instruments used to play the core melody or balungan gendhing.

It was decided for this study to exhaustively search just four instruments - the demung, saron, peking and bonang. Samples were taken and recorded from Elektro Budoyo ITS gamelan set. All gamelan sound are 16-bit, mono-channel, and frequency sampling 44100 Hz. We produced the sounds data by randomly hitting the keys or bars of metal with their own hammer at center, upper, and lower areas. In total this gave about 2645 samples across the entire pitch range of the four instruments.

We randomly partitioned the data into training data sets and testing data sets (see Table 2) to verify performance and evaluate its robustness. The testing data consists of samples of the four instruments. Each instrument was sampled and recorded across their entire range. This completely different dataset from the training set was used, as this should test the generality of the classifier. As shown in Table 2, the number of training set and testing set are listed. Most of the data is used for training, and a smaller portion of the data is used for testing.

Table 2 SVM method for gamelan instrument identification in different training and testing dataset

Data Sets

#Data training

# Data testing

1

2525

120

2

2393

252

3

2263

382

4

2127

518

5

1993

652

6

1866

779

7

1733

912

8

1599

1046

9

1472

1173

10

1325

1320

We used Support Vector Machines (SVMs) classifier to make a comparison of performance between the features ranking techniques. SVMs are based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. In this research we used LibSVM [18], one of efficient SVMs library. LibSVM includes for kernel functions: linear, polynomial, radial basis function (RBF), and sigmoid.

Table 3 The first fifteen of feature ranks on the ASR and spectral-based gamelan features; see description in the text

No

Methods

Feature rank

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

Gain Ratio (GR)

1

5

2

4

61

44

42

59

8

6

3

60

43

31

2

ReliefF (RF)

1

5

2

4

42

59

61

44

8

6

11

32

10

31

3. Experiments and Discussion

After the feature data was collected and extracted, we randomly partitioned the data into training data sets and testing data sets. The traning features data were ranked in descending order using ReliefF and Gain Ratio. The ranking of features obtained for the training data is presented in Table 3. The first 10 features are consistently ranked as the top. Those features are fundamental frequency, spectral flux, spectral centroid, spectral roll off 90%, cepstral coefficient 10th, filter bank energy 10th, filter bank energy 8th, cepstral coefficient 8th, spectral kurtosis, spectral skewness, and spectral roll off 40%. Four in the first ten features comes from ASR feature sets, that are cepstral coefficient 10th (CC 10th), filter bank energy 10th (FBE 10th), filter bank energy 8th, and cepstral coefficient 8th. Not surprisingly, the CC and FBE (8th and 10th) have similar ranks (bold and italic text in Table 3). Since, the CC is spectral derivative computed based on the FBEs.

For each ranking method, investigation of recognition accuracy on the testing data as a function of the features has been done in descending order. Recognition rate or accuracy was taken from prediction accuracy performed by support vector machine (SVM). The training process started with the least important feature, and repeated until the whole features have been included. Accuracy results as a function number of features in descending order are presented in Fig. 5. We measured the performance for subsets consisting of the ranked features. Where varies between 1 and 68, started from the least important features.

Figure 5 shows the degradation in the recognition rate or accuracy when the number of features subsets is reduced. A comparison of the two methods shows that the accuracy over 90% achieved with GR subsets are better than RF results. Both techniques show the same behavior without any significant differences. The accuracy is almost same until the subsets are reduced to 50 or less features, then the accuracy tends to decrease with reducing the feature subsets.

lena%258bit%25plain

Figure 5 Accuracy for the gamelan dataset as a function of n ranked features

4. Conclusion

In this paper, we have presented our approach to perform ASR and spectral-based features ranking using two filter-based ranking methods. We have investigated the use of FBEs and CC as features for gamelan instruments identification. Accuracy of the SVM classifier has been significantly influenced by the features ranking. It shows that Gain Ratio (GR) technique gives a slightly good result than ReliefF (RF) techniques. We also have compared the CC and FBE features with another spectral-based features and found them to be a good features set for gamelan instruments identification.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.