Music Classification Using Neural Networks Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- The purpose of this project is to study the feasibility of a music classification system based on music content using a neural network. A 1.5 second audio file stored in WAV format is passed to a feature extraction function. The WAV format for digital audio is simply the left and right stereo signal samples. The feature extraction function calculates 124 numerical features that characterize the sample. When training the system, this feature extraction process is performed on many different input WAV files to create a matrix of column feature vectors. This matrix is then preprocessed to reduce the number of inputs to the neural network and then sent to the neural network for training. After training, single column, vectors can be fed to the preprocessing block, which processes them in the same manner as the training vectors, and then classified by the neural network.

1 Introduction

Neural networks have found profound success in the area of pattern recognition. By repeatedly showing a neural network inputs classified into groups, the network can be trained to discern the criteria used to classify, and it can do so in a generalized manner allowing successful classification of new inputs not used during training. With the explosion of digital music in recent years due to Napster and the Internet, the application of pattern recognition technology to digital audio has become increasingly interesting. On the user end, many people have downloaded large collections of music files (e.g.MP3s and WAVs) that are often stored in directory structures classified by genre or artist. Thus one can imagine the usefulness of a program that would automatically classify and store

new downloaded music using the existing classification system set by the user. A second useful program would be one that searches through a collection of files and extracts only those with characteristics chosen by the

user. For instance, a user may want to search through a library of files stored on a computer in Austria for those that are of the classical music genre, but due to a language difference and the Austrian user's own preferences for file naming, determining the genre of each of the files may be very difficult to do using just file names.

Thus a program that makes classifications based on music content would be much more appropriate and useful. On Napster's and the recording industry's ends, classification of music based on content is necessary for ensuring that copyrighted music is not freely distributed across the internet. Filters based on file names have been found to be very ineffective, for clever users simply alter the names of the files to circumvent such filters. What is needed is a classification system that only looks at the content of the file to make it's classification decisions, for such a system would be much more effective since altering the content of the file is not a very appealing option to users. Figure 1 below is a block diagram of the classification system.

2 System Setup

This section describes the setup of the digital audio classification system. This system is composed primarily of the blocks above and was developed in the Mat lab environment.

2.1 Input Files

Data for training and testing the system was taken from ten compact discs, six classified as rock (labeled R01-R04), two classified as classical (C01 and C02), two classified as soul or R&B (S01 and S02), and two classified as country and western (W01 and W02).

The four rock CDs are recorded by four different artists. A complete source listing for these CDs can be found in Appendix A. The tracks on each of these CDs were extracted and converted to WAV format and then divided into segments of length 2 18 bits, or six seconds. To avoid periods within the music not characteristic of the whole song, the segments were all taken from the middle of each track. From this procedure 2,781segments of music were produced. The segments of music were then further divided into two sub-segments by extracting the first 2 16 bits (1.5 seconds) and the third 2 16 bits. Thus, in total, 5,562 sub-segments of music were generated to use for training and testing the system. For classification by genre, CDs R01, R02, C01, C02, S01, S02, W01, and W02 were used. For classification by artist the four rock CDs were used.

2.2 Feature Extraction

Ideally, all the samples in the WAV file would be passed to the neural network, and then the neural network would determine the best way to process the data to arrive at a classification of the file. However, at a sampling rate of 44.1 kHz, even a one second sample of audio would result in a prohibitive amount of information for the neural network and Mat lab. Therefore, a feature extraction function is needed to reduce the amount of data passed to the neural network. Extracting useful features from a digital audio sample is an evolving science and remains a popular research field. From the infinite amount of calculations that could be performed, this system uses only 124. These features fall into six categories described below. Table 1 outlines the format of the feature vector.

2.2.1 Linear Predictive Coding Taps

In linear predictive coding (LPC), a signal is modeled by the following equation:

yn + 1 = w0* yn + w1* yn - 1 + w2* yn-2 +

… + wL-1*yn-L-1 + en+1

The goal of this model is to predict the next sample of the signal by linearly combining the L most current samples while minimizing the mean squared error over the entire signal. The weights (wi's) are determined by using an adaptive filter and the LMS algorithm. For this system, the music segments were modeled using 32 taps (L=32). A block diagram of the adaptive filter used is shown below in Figure 2.

To speed up the execution time required to calculate the LPC taps, the code was written in C and compiled using the Mat lab MEX compiler, which resulted in a very significant decrease in execution time.

2.2.2 Frequency Content

Frequency content was found to be an important feature for classifying music. Three different frequency content calculations were performed and included in the feature vectors. The first frequency content features that were calculated were the amplitude values of the discrete Fourier transform (DFT) of the signal. Because the sampling rate for the WAV files was 44.1 kHz, the DFT of the audio sample shows only the frequency content up to 22 kHz. Initial analysis of the audio signal being tested revealed that the vast majority of the frequency power lies in the lower portion of this spectrum; therefore, the signals were sampled at T=2 before taking the DFT to effectively zoom in on the lower half of the spectrum. The positive values of the DFT spectrum were then grouped into 32 evenly spaced bins, and the average spectral energies in each of the bins were reported as 32 features. The second calculation made was to take the natural logarithm of the 32 DFT amplitude values and report these values as 32 additional features. These features emphasize the differences in the values at frequencies with very small DFT amplitude values, which are mostly the higher frequencies. These features are provided to distinguish different samples by their higher frequency content. The final calculation made was to take the inverse DFT of the logarithm of the amplitude of the DFT values. The lower 12 values of this calculation were reported as 12 more features and were included to further emphasize the higher frequencies of the samples.

2.2.3 Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients (MFCCs) have been used very successfully in the field of speech recognition as classification features for speech audio signals. The processing sequence for finding the MFCCs of an audio signal is the following:

Window the data with a Hamming window

Find the amplitude values of the DFT of the data

Convert the amplitude values to filter bank outputs

Calculate the log base 10

Find the cosine transform

The filter bank consists of 40 triangle filters with 13 spaced linearly by 133.33 Hz and 27 spaced logarithmically by a factor of 1.0711703 in frequency. The DFT amplitude values are combined using these triangle filters to form the filter bank outputs. Code developed by Malcolm Slaney as a part of his Auditory Toolbox was used to calculate the MFCC values. Fifteen MFCC values were reported as features and included in the feature vector [3].

2.2.4 Volume

The volume of a musical piece is easily calculated as the variance of the samples.

2.3 Data Preprocessing

The feature vectors returned by the feature extraction block were first preprocessed before inputting them to the neural network. Two types of preprocessing were performed, one to scale the data to fall within the range of -1 to 1 and one to reduce the length of the input vector. The data was divided into three sets, one for training, one for validation, and one for testing. The preprocessing parameters were determined using the matrix containing all feature vectors used for training and validation. For testing, these same parameters were used to preprocess test feature vectors before passing them to the trained neural network. The first preprocessing function used was premnmx, which preprocesses the data so that the minimum and maximum of each feature across all training and validation feature vectors is -1 and 1. Premnmx returns two parameters, minp and maxp, which were used with the function tramnmx for preprocessing the test feature vectors.

The second preprocessing function used was prepca, which performs principle component analysis on the training and validation feature vectors. Principle component analysis is used to reduce the dimensionality of the feature vectors from a length of 124 to a length more manageable by the neural network. It does this by orthogonalizing the features across all feature vectors, ordering the features so that those with the most variation come first, and then removing those that contribute least to the variation [4]. Precpa was used with a value of .001 so that only those features that contribute to 99.9% of the variation were used. This procedure reduced the length of the feature vectors by one half. Precpa returns the matrix transMat, which is used with the function trapca to perform the same principle component analysis procedure on the test feature vectors as performed on the training and validation feature vectors. This was done before passing the test feature vectors to the trained neural network.

2.4 Neural Network

A three-layer feedforward backpropagation neural network, shown in Figure 3, was used for classifying the feature vectors. By trial and error, an architecture consisting of 20 adalines in the input layer, 10 adalines in the middle layer, and 3 adalines in the output layer was found to provide good performance. The transfer function used for all adalines was a tangent sigmoid, 'tansig'. Levenberg-Marquardt backpropagation algorithm'trainlm', was used to train the neural network.

Figure 3

2.5 Classification Vectors

Two music classification systems where implemented and tested, one to classify by genre and one to classify by artist. Figure 4 shows the constellations used for each of these classification systems, and Table 2 lists the specific coordinates of the constellation for each classification scheme. The constellations were chosen so that all points where equidistant from each other, all coordinates where within the -1 to 1 range, and the distance between points was maximized. Originally a two dimensional constellation was used, but the increased distance between points gained by moving to three dimensions provided a significant performance increase. Constellations of dimension greater than three did not provide a significant enough performance increase to justify the added computational complexity.

3. Results

This section will discuss the results of training and testing the classification system. Two separate results will be presented, one for classification by genre and one for classification by artist.

3.1 Classification by Genre

To test the performance of the music classification system, the system was first configured to classify music by genre. The four genres used were rock, classical, soul/R&B, and country and western. The first step in performing this test was to generate the data set. As discussed above, the data set was taken from eight CDs, two per genre, and consisted of 4,425 feature vectors. From these 4,425 feature vectors, 2,213 were used for training and the other 2,212 were reserved for testing. Before training, data preprocessing was performed on the training data, as was discussed above. After preprocessing, the training data was divided further into two groups, one for training and one for validation. A validation data set was needed to ensure that the neural network did not overfit the data. The next step was to create the neural network discussed above in the system setup section. The training function used was Levenberg-Marquardt backpropagation algorithm, 'trainlm.' The parameters mu, mu_dec, and mu_inc of 'trainlm' were set to 1, 0.8, and 1.5 in order to ensure that the algorithm did not converge too quickly, which helped to limit the amount of overfitting that occurred before a validation stop of the training. Figure 5 below shows the MSE versus training epoch plot both the training data MSE and validation data MSE curves are shown. The MSE reached 0.0228 before a validation stop occurred.

After training, the system was then tested using the data set reserved for testing. Before passing the test feature vectors to the trained neural network, data preprocessing was performed using the saved parameters from the preprocessing of the training data. The results are summarized in Tables 3 and 4. Figure 6 shows a three-dimensional plot of the output vectors of the neural network for each of the test input vectors. The decision rule used for classifying the output of the neural network was a minimum distance rule. A decision was made by first calculating the distance from the output of the neural network to each of the constellation points and then choosing the constellation point that produced the minimum distance.

Genre classification was performed at a success rate of 94.8%, with classical music being classified the most successfully, 96.7%, and country and western, soul/R&B, and rock music being classified the least successfully at success rates of 91.0%, 93.1%, and 93.3%. The separation of success rates between classical music and the other three genres was expected since the four genres are not equally distinct in style. Classical music is definitely the genre that stands out as being the most distinct among the four genres, while country and western, rock, and soul/R&B can be grouped as musical genres of a somewhat similar style. Country and western, rock, and soul/R&B have each influenced one another throughout their growth into separate musical genres, and thus one would expect several features of each genre to be mimicked in the other two. Furthermore, out of the three non-classical music genres, country and western music was the genre that was classified incorrectly as classical music the most. This was also an expected result, since country and western music features instruments that are the most similar to those used in classical music (i.e. stringed instruments such as the acoustic guitar and violin).

3.2 Classification by Artist

To further test the music classification system, the system was configured to classify music by artist. Four rock artists were used which I will call R01, R02, R03, and R04. Data for this test was taken from the four rock CDs, which are listed in Appendix A. The training and testing of this system \ were performed identically to the \ training and testing of the system for classifying by genre. From the four CDs, 2,187 feature vectors were extracted and split into two equal groups, one for training and one for testing. The training data set was then further divided to form the training and validation data sets. Training was performed using the same preprocessing, training function, and parameters as described in the classification by genre section. Figure 7 below shows the MSE versus training epoch plot - both the training data MSE and validation data MSE curves are shown. The MSE reached 1.81e-5 before a validation stop occurred. By comparing Figures 5 and 7, it is evident that more over fitting occurred when training the system to classify by artist, which is discussed further below.

After training, the system was tested using the feature vectors reserved for testing. The results are summarized in Tables 5 and 6, and Figure 8 shows a three-dimensional plot of the output vectors of the neural network for each of the test input vectors.

4.1 More Advanced Feature Extraction

The field of music feature extraction is a rich research area, for improving feature extraction will most likely have the largest impact on the performance of a music classification system. For the system detailed in this paper, feature vectors were extracted from 1.5-second music samples, and although the system performed well, 1.5 seconds does not capture all the characteristics of an entire song. What is needed is a feature extraction method that looks at more of the song in an attempt to not only capture "short-time" features but also "long-time" features that describe how the song evolves over time. One way to implement this is to simply use entire songs as the input to the feature extractor, but at high sampling rates, this leads to a prohibitively large amount of data for the feature extractor to process. A second approach would be to send several small samples, such as 1.5-second samples, that are equally spaced throughout a song to the feature extractor. The feature extractor could then extract "short-time" features from each of the samples and then produce "long-time" features by examining how the extracted "short-time" features evolve with time. A feature extractor that considers an entire song would be a start towards developing a more advanced feature extractor, but even more needs to be done. Probably the toughest problem that needs to be solved is how to extract features that describe the very personal performance style of a musical piece. These are the features that will be necessary for making correct decisions when the differences in the pieces of music are very subtle, such as occurs when classifying music by artists within the same genre.

4.2 More Advanced Decision Rule

The decision rule used in this system assumes that the noise that drives the output of the neural network away from constellation points is equal among all classification categories. From the results of the classification by genre section, Figure 6, this assumption is obviously not accurate. A more advanced decision rule that partitions the output space into classification regions in a more clever manner would definitely lead to better results. The main approach to implementing this is to observe the outputs while training and to assign larger regions to the classification groups experiencing the most noise or deviation. By providing more room for error for the more noisy classification groups, the error rate will be driven closer to zero and better balanced among all groups.

4.3 MP3 Files instead of WAV Files

Given the popularity of the MP3 format for digital audio, a system that would take MP3 files as input instead of WAV files is desired. The system presented in this paper can be easily converted to take MP3 files as input by pre-appending an MP3 to WAV converter to Figure 1. This approach is valid and may be the best choice, but currently converting MP3 files to WAV files is a computationally intense procedure that requires a somewhat significant amount of execution time. However, as computer performance continues to advance, this problem will become negligible. An alternate approach is to design a system that works exclusively with MP3 files, that is, extracts features directly from files in the MP3 format. The drawback of this approach is that new methods for extracting features from highly compressed data would have to be researched, and most of the current feature extraction research would become irrelevant. However, highly compressed data may contain valuable features not obvious in the uncompressed data, making such research worthwhile. This leads to the idea of creating a hybrid system that extracts features from both the WAV and MP3 versions of the file, thus using the best of both worlds.

4.4 More Classification Categories

To make a music classification tool useful, the number of classification categories needs to be increased to more than four. Implementing this improvement would involve working several different areas. For instance, one would need to find a way of determining the constellation dimensionality needed to provide enough distance between points to provide acceptable system performance. Another area that will need work is the area of feature extraction, for more advanced feature extraction may be necessary to provide a sufficient set of features for the neural network to have enough information to discern more than four classifications. Also, a more advanced decision rule will be needed to provide a clever partitioning strategy of the output space so that categories experiencing more noise will be given more room for error. One alternate approach to increasing the number of categories is to set up a categorization system in the form of a tree structure. If each node in the tree has a maximum of four children, then the four-category classification system presented in this paper could be used to move down the tree from node to node until a category at the bottom of the tree is reached. Such a system would require a separate trained neural network for each node in the tree, but it would avoid many of the issues discussed above involved in implementing a flat categorization system.