Classifying Environmental Audio Recordings

David Morcuende
3 min readJul 1, 2020

--

Making a machine learning model to classify 50 different environmental audio recordings.

Dataset

The dataset that has been used is the ESC-50 dataset

The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

If we load the data we can see that there are 7 columns but we only need the audio filenames and the category.

So we are deleting the unnecessary columns and creating a new one with the encoded categories.

We can see that there are 50 different categories and that every category is balanced

Transforming sound to image

To transform sound to an image we are going to use the librosa library.

First, we are transforming the audio to a waveplot.

This example is from a clock tick audio, we can listen to it with the Ipython Display utils library.

Once having the waveplot, we are getting a spectrogram from it, again with librosa.

And the normalized spectrogram

Now that we can transform audio to image we are making a function to process all the audios and transform it into images. We are going to save them using the same directory structure that then the PyTorch library ImageFolder will need.

For example:

Model

We are using PyTorch to build and train the model, so we can use the ImageFolder function from torchvision to load all the image. But first, we are resizing it to a 32 x 32 image and converting them to tensors.

One example of the tensor and the image could be:

Then we split the dataset into 3: train, validation, and test, 75%, 20%, and 10% respectively.

and finally, we get a dataloader from those datasets

We are using a batch size of 32. To see one batch from the dataloader we can use this function:

Then we define some metrics and validations with the class ImageClassificationBase that has functions to calculate the training and validation losses.

Finally, we build and train the model, in this case, it will be like cifar10 model.

The model:

We define some auxiliar function for the dataloader and to use GPU accelerators

The training, in this case, we run for 104 epochs with a lr of 0.001and another 25 epochs with 0.0001.

Both with the adam optimizer, the losses are:

And the accuracy:

There are clearly overfitting and there aren’t so good results, so we will have to try different models.

On the test results, it doesn’t perform good either

You can find all the code on https://github.com/Morcu/ESC-50-audio-clasification

--

--