Recordings Of The Vocalisation Of Birds Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A large number of biological studies obtain recordings of the vocalisation of birds which they gather in the field for technical research [Kogan], additional to this many amateurs and enthusiasts undertaking similar activities as a hobby. Many of these recording are analysed to produce spectrograms that are used to identify birds or species calls and to monitor the ecosystem in regards to the avian population. Generally research of this kind is conducted in natural environments, which leads to technical difficulties in regards to locating and monitoring these birds. Many of these difficulties can be overcome by using detections and recordings of the vocalisations that are produced and subsequently analysed by experts. Wants the data has been analysed it can be used to determine specific information about a species or how individuals interact within a colony. The use of bird vocalisation is an adequate way to conduct environment monitoring, ecological censuring, biodiversity assessment, etc [lee].

Current techniques for obtain this data require a device being placed in situ in a test area or near a specimen to be monitored. Alternatively, an individual or a group will attend and conducting a census on location. Both methods are currently widely used, but also have their limitations and draw backs. For instance, if a recording is to be taken, the device has to be taken and retrieved from the site, which can be is a time consuming process. An expert then has to listen to and segment the data either manually or automatically and subsequently makes observations which can be extremely time consuming. Alternatively if the birds are monitored on site, some birds may be less inclined to vocalise if there are humans present [Douwe] and requires trained experts who are able to identify the bird. Both of these techniques are long and arduous processes. With many hours dedicated to the listening and deciphering of the vocalisations and requiring experts who can identify different species. Conducting these inspections manually can be prone to errors with cross checking required which duplicates the work and effort. [Kogan] This all inevitably leads to cost. With experts requesting a premium, there is a need to reduce operations in the form of time and cost. There is therefore the need for an automated system that is able to carry out this analysiation with greater speed, reduced cost and greater degree of accuracy compared to current procedures.

Previous work carried out to create a solution has centred around current techniques for automated speech and speaker recognition that are used in human research. With bird vocalisation a typical pattern processing problem with a signal pre-processing feature extraction and classification section [Fagerlund] the problem of recognition is comparable to that of human speech. With bird vocalisation being relatively simplistic compared to human speech, the use of automated recognition can facilitate recognition of birds as well. With human vocalisations consisting of subunits organised stereotyped hierarchies, sentences, words, and letters. This also applies to birds, with their notes, syllables and songs [Catchpole and Slater, 1995]. [kogan] (**<- this bit is meant to say same problem, birds songs simplier, so human recognition can be used...but doesn’t quite read right) So far, comparatively little work has been done in fulfilling development of a software that is able to similar sort of recognition on animals. Most of the work that has been carried out in this field has focused on the use of clean recordings that have been produced in a controlled environment and. Therefore have not focused on providing a tool that is able to automatically detect birds songs in a real-world environment. [Diplomarbie]

It is therefore necessary for research to be carried to identify and create a possible technology that is able to incorporate known techniques for human speech recognition and develop these into a working for birds (that is capable of accessing real-time data, there are a spectra of ideas and approaches that have been applied to this research from this area. However, much of this research has focused on the possibilities of the technique been used on bird recognition rather than actually applying it to the problem. Much of this research has focused around the use of GMM and HMM classifiers, with a commercial application called Songscope that has previously been produced using a HMM classifier (use Douwe). Previous work has been created and the use of DTW (list who), ANN which has show promising results (quotes) and SVM (list who). The difficulties with this research how, it is difficult to determine which method are in fact more reliable option. With no standardised data set being used, it is difficult to determine result efficient with sample quality ranging and no standard test data it makes it difficult to determine each works effectiveness.

This thesis will therefore tried to develop (Needs finishing but needs to speak to Phil)

Additional commercial concepts

In this section a look at the possible applications of a developed technology could be used for.

Conservation and entertainment

Although not always conceived, birds are an integral part of the ecosystem. They serve many purposes that including, distribution of seeds, rodent and insect control and a food source for birds of prey. Being able to monitor populations, allows experts to help maintain a bio diverse environment. There is therefore a demand for a product that is able to enable this work efficiently and provide a service that does not require an excessive outlay.

As well as the experts there are many groups and individuals that are interested in tracking and identify birds through a passionate interest. With the aid of software that is able to aid those in their quest for understanding and enjoyment.

Air Industry

With a large number of bird collisions a year, the cost of repairs to damaged vehicles in North America alone is around two billion pounds per year but greater than that, collisions with birds sometime equate to loss of life. [4] The problem of migrating and residential birds causing greater number of collisions per year with a large proportion of these occurring during takeoff and landing, an automated early worrying system of large avian being in the area may save cost and lives. With greater number of airports being open yearly and air travel progressively increasing this is a problem that may get progressively worse.

Commercial growth

Although birds are presents are in are cities, they are effected, like other animals, due to the ever expanding human populations and the need to expand which course lost of habitation and other essential requirements for avian survival. As human become increasingly aware of the destructive it course due to the building of factories, business parks and expanding cities into area that wants were natural environments. Several methods of managing the adverse effects have been used by controlling bodies. Many of these are studies survey the damage that such projects to the local ecosystem which enables areas of natural importance to be preserved. There is everdently a market for a product that is able to carry out this process automatically.[Diplomarbeit]

Wind turbines

Bird impacts with wind turbines have recently been published in the press Wind turbines are increasing in number and are seen as an ecological way to produce energy of the future. Possibably because this technology is new more attention is given to the birds they kill. However, Many collisions occur whilst birds are migrating in serve weather conditions when they are unable to locate the tower or become attracted to the light. There is therefore a possibility for a produce that is able to detect the approaching birds by their calls and make the turbines less attractive by adjusting the towers colour or turning the turbine off completely.

Detection Goals

As an algorithm is being produced there outputs needs to be the area of biological censuring and biodiversity assessment. Therefore, the output should represent the attended goal. Previous mentioned topic would require different levels of species or individual recognition and this produces its own unique set of requirements. To successfully produce a detector for a given situation the following information has to be considered. (Wolffe)

Definition of the elementary detection subject(s): To be able to detect groups or individuals.

Specification of the detection accuracy: The number of false positives, the ability to determine some/all songs from a bird.

Quantification of the detection classes: The ability to determine groups or individual birds. (Wolffe)

Bird vocalisation

Birds produce sounds for various reasons, with the majority falling in the categories of songs and calls (Krebs & Kroodsma 1980, taken from fagerland 2004). Songs are generally longer than calls and are more musical and harmonic with them gernally being sung to attract mates or define terrorises. Calls are generally shorter and are used to alter others of impending dangers including predators. However, not all birds are songbirds and only aroud 50% being able to produce songs. The remainder are able to produce calls that enable them to be able to communicate with others. (Beckers, Suthers & ten Cate 2003, taken from ferguland 2004) Songbirds are generally more able to produce complex sounds this is due to them being able to control the production of sound better which enables them to have a larger repertoire. (Gaunt 1983, fagerland 2004). Birds, like humans, have a clear structure to the sounds they produce and can be subdivided in phrase, syllables and elements. (douwe)

The bird will use its lungs, bronchi, syrinx, trachea,larynx, mouth and beak or a combinations of these

body parts to produce a sound. Air passing from the bronchi to the ... is the main process used to generate the sound. The sound that is generated by the syrinx is then modulated by the vocal tract. [Fagerland 2004]

Unsupervised Detection

The use of bioacoustics monitoring is a functional tool to allow evaluation of the bird population. [Wolffe]There are however extenuating circumstances that have to be taken into consideration in regards to using speech or speaker recognition technology and applying them to bird song recognition. The collection of data samples from human for example is far easier than that of collecting from birds. This is due to the fact that the researcher is able to collect data in a controlled environment, tell the speaker when to speak, allows them to obtain recording with little to no periods of extended silence and produce a sample that has a high signal-to-noise (SNR) ratio. Compare this to the collection of bird vocalisations in a real- world situation. Samples have to be collected in the bird’s natural environment where it maybe an unrestricted distance away from the recording device, there is additional obstacles like trees and foliage that may interfere and course echo and reverberations on the recordings. Obfuscated due to the large amount of background noise that effect the recording, with noise associated with other animals, other avian, human presence including planes and trains and natural occurring events like wind and rain. Which leads to a sample that has a low SNR and the recognisor may have difficulty recognising the birds species. This leads to additional pre-processing requirements, with the signal needing to go through a process of extraction and normalization for it can be used.

Another issue is that human speech has a particular bandwidth that is concentrated in an approximate 3KHz bandwidth which is unquie to humans. However, with birds, sounds can vary significantly with different species being able to produce sounds in different bandwidths. These may occur in a large frequency range anywhere from 10Hz to 10,000Hz.(ferguland 2004) The variation of the vocalisation makes detection difficult, with some calls being short and having a narrowband with distinctive spectral features. Whilst other songs, maybe be long with complex spectral differences. Due to this variation is becomes excessively difficult to produce an algorithm that is able to successfully detect such a broad spectral frequency.

Also, Due to the vast amount of training data that is available for human speech recognition it makes it easier to be able to model individual variations due to the collated data of thousands of individuals. Much of the research carried out on bird vocalisation is limited to a much smaller amount of test data, which makes it difficult to be able to model the large variations in many species (plagiarism needs rewording). [aiasftv - songscope]

Taking this information into account it is easy to understand the complications that bird detection compared to human voice recognition is and the complications in applying an algorithm faces. With adaptation of an algorithm requiring an aggressive procedure of pre-processing that enable the recording to be used for a classifier that is enabled to successfully determine a birds species. (needs more)

Finish off with Daniel wolff new ones. Put in 3 pictures

In the literature

Kogan and Margoliash

One of the earliest pieces of work in this field was conducted by Kogan and Margo-liash [9]. The research that was carried out was to produce a DTW and HMM classifier models. They carried this out by using samples from two species of birds, the zebra finches and indigo buntings. These two species were chosen due to their difference in relation to spectral band their vocalisations are able to produce. To construct the HMM classifier the Markov Model Toolkit (HTK) was used.

In regards to the DTW, the input data was compared with a set pre-specified template patterns, this was done by making system labels and segments of the input data. [Douwe] [9] To produce this a 3-dimemntioanl lattice (i,j,k) of the temporal frame and those of the template had to be organised. In the lattice, i indicate the index of the input frame, j indicates the temporal frame index and j indicates the amount of individual templates (Douwe). The distance between the two-multidimensional of the input signal that which are calculated by making use of the distance metric. This is done by calculating the input signal at time from i and the time-frame j of the template k. The distance between the two-multidimensional vectors is taken and this produces the dynamic time warp. Therefore the input is used to try and find the optimal sequence of template patterns. A constraint was placed on the time-warping which allowed the DTW to do time-warping at a factor no greater than 2. Allowing this produces a method that is not variability in the spectral

Domain but only the temporal domain is accounted for.

As for the HMM classifier, the HTK was used. Due to HTK being used for human speech recognition, it had to be adapted for use with bird recognition. (Douwe 9) A structure of five left-to-right models were mostly used to identify three different types of sounds songs, calls and cage noises. In the model the outer two states were used to facilitate transitions between different models where as the middle three were used to emit the signal. A GMM model was used to produce the emition probabilities.

To enable them to determine the optimal state sequences the Baum-Welch algorithm was used as well as the Viterbi algorithm during the token passing implementation. All training files were selected randomly, with those with particular uncommon elements specifically being used. It was made sure however the test samples included all vocalisations classes.

The data was labelled; this was done partially or fully by hand. To enable to produce high reliable data the labels were crossed verified, this was done for all testing and training data. From the beginning, middle and end of the recording the templates for the DTW approach were manually selected. The frequency bins for the flourier transformation were selected for use as the features for the DTW approach, the bins that had a frequency of less than 50Hz were taken out to remove the noise that had a low-frequency. This left frequency bins in the region of 50Hz to 10KHz. Experiments were used to determine the fame rate, FFT size and minimum duration for elements of each bird separately. (Douwe 9) In regards to the HMM tests, HTK was used to created six parameters that are which included MFCC and LPCC that are generally used in speech recognition. It was determined that MFCC work far better with bird vocalisation than LPCC and was therefore used with first and second derivatives. The window overlap and window size were determined experimentally and worked out on a species basis.

The results showed that both techniques worked equally well but were susceptible to background noises. Many more templates were needed to be used in the DTW approach do this noise which made it much more sensitive to the noisy environment. By adding extra templates, particularly in the cage noises, the DTW was able to outperform the HMM, but to enable this required a greater degree of work. Subsequently, both models were able to detect both species of bird and syllables with a high degree of accuracy. With the HMM model proving more robust. With a small amount of states 4 instead of 5 would improve the recognition for shorter syllables.

Somervuo et al.

Expanding on the previous work conducted by Kogan and Margoliash, Somervuo et al used sinusoidal models and MFCCs which had previously been used to represent tonal bird sounds and vectors of descriptive features and produced good results (Douwe 11). Feature recognition was done by DTW, HMM and GMM which enabled them to undertake the recognition process. The smooth energy envelope of the signal by segmenting the bird vocalisations into syllables, wants this is done by selecting the regions above a syllable threshold. To compute the threshold of the background noise the estimate of half of its level is taken and the global minimum is initialized. This minimum is then updated by iterating by locating gaps between syllables, the energy between the gaps is used to set the background noise level then the syllable threshold is set to half that. The iteration is continued until fairly stabilised.

The syllables, wants they are found, are determined to see if they are able to be grouped to others close that are less than 15ms apart. The automatic segmentation was found to be able to detect around 93% of all syllables compared to a manual segmentation of 50 songs, this varied considerably between species.

The FFT of the signal with a 50% overlap between frames and a hanning window of 256 samples and were made to produce the sinusoidal representation. A computation was then done to generate the FFT of size 1024, with zero padding. The samples of the audio were at 44.1Hz. The frame with maximum energy for the frequency component is then chosen.

Alex L. McIlraith et al.

In this paper the author attempts to produce a back-propagation neural networks that is able to recognize bird songs (McIlraith). This was attempted by using non-temporal ended vocalisations which comprised of 133 songs from six different bird species. In regards to pre-proccessing LPC and FFT were used to prepare the data, which was left justified by hand using the software package Hypersignal plus (MciLraith).

A non-overlapping Hammering window which had 256 samples was used to produce the framing. An LPC of each frames using 16 time domain coefficients for a 15th order LPC filter (Mcilraith). The 16 LPC coefficients were used to construct a FFT with 9 unique spectral magnitudes. The procedure was later repeated using 1024 sample windows. Further work was carried out to determine the overall length of the actual vocalisation due to this being believe to be an important cue to determining the identification of the species. Adding this addition variable enables the networks to determine the vocalisation through this hint (Mcilraith 22). The time variables and spectral were set to a standard deviation of one and a mean of zero, this was done through normalisation. Using a logistic function with a gain of unity the variables

Three different data sets were created, with two having 10 variables, with the data window size being either 256 or 1024 samples and the song length. The third consisted of records with 1024 sample size with all 19 variables included, which were repeated similar to song length. The learning model ‘vanilla’ back-propagation without higher order derivative information and momentum was choosen for the classifier. The back propagation PDP algorithm (PDP 24 eh???) was used to accomplish the training. To accerlateate the learning, the learning rate was set to 0.2 and the target values were changed from 0,1 to 0.2 and 0.8. Previous experiments had determined that a network with 12 hidden nodes, 10 to 19 inputs and 6 ouptputs would be significantly ample for the procedure. The data was then divided, with test and training data sets. With around 25% allocated to the training set in all runs. (McIlraith). Each time the network was trained with new initial weights, with 10 training and 10 tests generated for each of the three sets. This enabled enhanced cross-validation, with all data generated the same order to enable comparision. Pleminary test showed that 1500 epoches were significant to reach a stable value in regards to the mean sum-of-squares error (MSE...check!!). A program that was programmed in C computed the six output values for the test set records and an error sum-of-squares (expected output vs. for target), this was done by the program reading the final weights. Any activation > 0.6 for any of the songs were tallied for the six outputs. The winning class was the one with the highest tally. If there were incidents of two tied on the maximum value, the classification was considered to be incorrect.

The classifier was able correctly recognize around 80 to 85% of the samples assessed. With the data set that included 1024 window samples outperforming those that included 512 samples. With the mean MSE being smaller and larger respectively for the training data, with the combined data set being the smallest. Which are significant over the duration of the training cycle. Comparing the MSE using a Two-tailed t-tests comparing the mean MSE at epoch 1500 indicate p-values <<0.000001. Which means that there is only one in a million chance that the mean MSE can be down to chance. (McIlraith). It was found that some of the species had bimodal frequency distributions, which lead to songs consistently being misidentified. This was often due to the data sets used. Another issue arose from using data that was obtained from the internet, which lead to misclassification due to different dialets being used.