Analyzing Bird Audio
An analysis of bird audio from Xeno Canto to determine if it was a song or a call.
Authors: Adithya Balaji, Malika Khurana
This report details our process for analyzing bird audio, with some snippets of code. You can find the full project repo on github.
import IPython.display as ipd
import requests
Introduction
We aim to accurately classify bird sounds as songs or calls. We used 3 different approaches and models based on recording metadata, the audio data itself, and spectrogram images of the recording to perform this classification task.
Motivation
The primary motivation to address this problem is to make it easier for scientists to collect data on bird populations and verify community-sourced labels.
The other motivation is more open-ended: to understand the "hidden" insights in bird sounds. Bird calls reveal regional dialects, a sense of humor, information about predators in the area, indicators of ecosystem health—and inevitably also the threat on their ecosystems posed by human activity. Through the process of exploring bird call audio data, we hope we can build towards better understanding the impacts of the sounds produced by humans and become better listeners.
Songs vs Calls
Bird sounds have a variety of different dimensions, but one of the first levels of categorizing bird sounds is classifying them as a song or a call, as each have distinct functions and reveal different aspects of the birds’ ecology (1, 2).
ipd.Audio(
requests.get(
"https://github.com/adithyabsk/bird_audio/blob/main/notebooks/assets/574080.mp3?raw=true"
).content
)
Calls
Calls are shorter than songs, and perform a wider range of functions like signalling food, maintaining social cohesion and contact, coordinating flight, resolving conflicts, and sounding alarms (distress, mobbing, hawk alarms) (6). Bird alarm calls can be understood and passed along across species, and have been found to encode information about the size and threat of a potential predator, so birds can respond accordingly - i.e. more intense mobbing for a higher threat (7, 8). Alarm calls can also give scientists an estimate of the number of predators in an area.
ipd.Audio(requests.get("https://github.com/adithyabsk/bird_audio/blob/main/notebooks/assets/585148.mp3?raw=true".content)
Related Work
Allometry of Alarm Calls: Black-Capped Chickadees Encode Information About Predator Size (8) The number of D-notes in chickadee alarm mobbing calls varies indirectly with the size of predator.
Gender identification using acoustic analysis in birds without external sexual dimorphism (9) Bird sounds were analyzed to classify gender. Important acoustic features were: fundamental frequency (mean, max, count), note duration, syllable count and spacing, and amplitude modulation.
Regional dialects have been discovered among many bird species, and the Yellowhammer is a great example (10, 11) Yellowhammer bird sounds in the Czech Republic and UK were studied to identify regional dialects, which differed in frequency and length of final syllables.
DVC
Data Version Control (DVC) is a useful tool for data science projects. You can think of it like git but for data. We built out our pipeline first in jupyter notebooks, and then in DVC, making it easy to change parameters and run the full pipeline from one place.
Collecting Data
For our analysis, we used audio files and metadata from [xeno-canto.org]. Xeno-canto (XC) is a website for collecting and sharing audio recordings of birds. Recordings and identifications on XC are sourced from the community (anyone can join).
XC has a straightforward API that allows us to make RESTful queries, and specify a number of filter parameters including country, species, recording quality, and duration. We used the XC API to get metadata and IDs for all recordings in the United States, and saved the JSON payload as a dataframe and csv. Below we see the main snippet of code from the DVC step that parallelizes data collection from XC.
Filtering & Labeling
Through our DVC pipeline, we further filtered by the top 220 unique species, recordings under 20 seconds, recording quality A or B, and recordings with spectrograms available on XC. This reduced our dataset size from ~60,000 to get a dataframe of 5,800 recordings. We created labels (1 for call, 0 for song) by parsing the 'type' column of the df.
The following scripts handle that process:
Exploring & Visualizing Data
With our dataset assembled, we began exploring it visually. A distribution of recordings by genus, with song-call splits shows that the genus most represented in the dataset are warblers (Setophaga) with many more songs than call recordings. We can also see that, as expected, woodpeckers (Melanerpes), jays, magpies, and crows (Cyanocitta, Corvus) have almost no song recordings in the dataset.
A map of recording density shows the regions most represented in the dataset which are, unsurprisingly, bird watching hot spots.
Given our domain knowledge that songs serve an important function in mating, we expected to see a higher proportion of songs in the spring, which is confirmed by the data.
Metadata Classification Model
In our first model, we used the tabular metadata from XC entries to train a Gradient Boosted Decision Tree (GBDT) model using XGBoost. XGBoost, is a particular Python implementation of GBDTs that is designed to work on large amounts of data.
We used the genus, species, English name, and location (latitude and longitude) from XC metadata. These features were then all mapped and imputed using sklearn transformers to one-hot encoded form apart from latitude, longitude, and time (all mapped using standard or min-max scaling, and time features transformed with a sin function). We can see 10 rows of unprocessed data in the HTML table below.
Here we also see a snippet of the data transformation pipeline and model training code which was done in the following jupyter notebook.
Building Audio Features
We ran audio data through a high-pass Butterworth filter to take out background noise. We tested different parameters for Butterworth and Firwin filters, then examined resulting spectrograms and audio to determine which best reduced background noise without clipping bird sound frequencies.
The below code snippet shows the process of loading the .mp3
file and performing the above filtering steps before
saving as a pd.DataFrame
which is what ts-fresh expects.
Feature Selection & Extraction
We used ts-fresh to featurize each audio array after unpacking and filtering to avoid running out of memory. ts-fresh takes in dataframes with an id column, time column, and value column.
ts-fresh provides feature calculator presets, but due to their and librosa.load
's long runtimes (13+ hours for 5% of
the dataset), we manually specified the following small set of features based on our domain understanding of bird audio
analysis.
Lastly, we passed this "static" time series feature dataframe into a similar XGBoost model (from above) to predict the output class.
Spectrogram Classification Model
We used a computer vision approach to analyze spectrograms using fast.ai pre-trained model. We use an xresnet18
architecture pre-trained on ImageNet.
We load the data using fast.ai's ImageDataLoader
. The model is then cut at the pooling layer (frozen weights) and
then trained on its last layers to utilize transfer learning on our spectrogram images. A diagram of the architecture
pulled directly from the original resnet paper is included below.
The model itself was trained on a Tesla K80 using Google Colab to speed up the training process. Additionally, we used Weights and Biases to track the training and improve the model tuning. We've listed the main snippets of code below that handle the training process.
Results
Across our three models, we achieved scores in a range of 64-77%. This is above the baseline score of 55% (mean of labels), and we believe with more time to tune and ensemble the models, one could achieve an even more accurate classifier. We are encouraged by the amount of room both the time series based and sonogram based models have for improvement given that the metadata model wipes the floor in terms of accuracy.
Plots
Metadata Model
We note a plateau in the XGBoost validation accuracy which tends to suggest that further improvements in early stopping may be achieved.
Additionally, due to the nature of the decision tree based model we are able to compute feature importance. The most important features include the genera - this is not so surprising when we recall our genus-count distribution and see that the genera here are mostly those with recordings that are almost entirely songs or calls. The other important feature is month - again, we recall that in the spring the ratio of songs to calls goes up, so time of year is a "good" feature.
Time Series Model
We can see that the test loss increases due to over-fitting, also evidenced by the very high training accuracy. This is a potential area of improvement in further research.
Spectrogram Model
This is the direct output from WandB which depicts the training process for the fine-tuned xresnet model. It is important to note that the X axis is steps and not epochs as this model was only trained for a single epoch (to save time and memory).
Future Work
We would like to note that there are a couple of immediate next steps that the project could take to dramatically improve the model performance
- Ensembling the 3 models using a
VotingClassifier
- More training time for the Spectrogram model (only 30 minutes was provided for fine-tuning)
- Additional epochs (only 1 epoch was provided)
- Filtering features in the audio classification model (ts-fresh likely generates more features than are needed)
Long term: integrate model with Xeno Canto to provide tag suggestions based on the audio clip
Conclusion
The classification of song vs call is the first distinction one can make in bird audio data across species, and on its own can give insights into the number of predators in an ecosystem, the timing of mating season, and other behaviors. It could also be valuable when part of a larger system of models. This report presents a promising start to tackle this problem with three separate machine learning models with reasonable accuracy. These models will likely prove quite handy in downstream classification tasks that look to find species, gender, location, and other parameters from the bird audio sample.
References
- "A Beginner’s Guide to Common Bird Sounds and What They Mean." Audubon.org.
- "Two Types of Communication Between Birds: Understanding Bird Language Songs And Calls." Youtube.
- "Bird Vocalization." Wikipedia.
- Gorissen, Leen, et al. “Heavy Metal Pollution Affects Dawn Singing Behaviour in a Small Passerine Bird.” Oecologia, vol. 145, no. 3, 2005, pp. 504–509. JSTOR
- Ortega, Yvette K.; Benson, Aubree; Greene, Erick. 2014. Invasive plant erodes local song diversity in a migratory passerine. Ecology. 95(2): 458-465. Ecological Society of America
- Marler, P. (2004), Bird Calls: Their Potential for Behavioral Neurobiology. Annals of the New York Academy of Sciences, 1016: 31-44. https://doi.org/10.1196/annals.1298.034
- "These birds 'retweet' alarm calls—but are careful about spreading rumors." National Geographic.
- Templeton, Christopher N., et al. “Allometry of Alarm Calls: Black-Capped Chickadees Encode Information About Predator Size.” Science, vol. 308, no. 5730, American Association for the Advancement of Science, 2005, pp. 1934–37, doi:10.1126/science.1108841.
- Volodin, I.A., Volodina, E.V., Klenova, A.V. et al. Gender identification using acoustic analysis in birds without external sexual dimorphism. Avian Res 6, 20 (2015). https://doi.org/10.1186/s40657-015-0033-y
- "About yellowhammers." Yellowhammer Dialects.
- Harry R Harding, Timothy A C Gordon, Emma Eastcott, Stephen D Simpson, Andrew N Radford, Causes and consequences of intraspecific variation in animal responses to anthropogenic noise, Behavioral Ecology, Volume 30, Issue 6, November/December 2019, Pages 1501–1511, https://doi.org/10.1093/beheco/arz114
- "Open-source Version Control System for Machine Learning Projects." DVC.
- xeno-canto.
- scikit-learn.
- xgboost.
- fast.ai.
Metrics
Word Count
1753 words
Code Line count
We used CLOC to generate the code line counts
Language | Files | Code |
---|---|---|
Jupyter Notebook | 9 | 1195 |
Python | 8 | 397 |
Sum | 17 | 1592 |