Researchers of Google’s AI division have come up with a reworked approach to audio classification tasks, aiming for improved performance and better adaptability for a range of application domains.
In a paper that will be presented at this year’s ninth International Conference on Learning Representations (ICLR) in May, the researchers will introduce its Learnable Audio Frontend LEAF, which instead of relying on various fixed operations, works with learned ones.
The work largely circles around the fact that unlike other fields in machine learning that can use raw data, “deep neural networks for audio classification are rarely trained from raw audio waveforms”. This is largely due to the fact that ML systems are often designed to mimic the human signal processing apparatus. In audio-related scenarios, this means data has to be preprocessed to match the way in which humans perceive frequencies, giving more importance to the lower spectrums.
This, however, might not always be necessary or even helpful, the researchers pointed out, using the example of recognising whale calls. They thus propose a more tailored approach. LEAF is guided by the steps used when creating the traditionally used mel filterbanks: windowing a signal to capture a sound’s time variability, filtering, and compressing. Instead of using the fixed layers designed to get close to how a human would perceive a signal, LEAF looks to learn the operation best suited for the use case at hand (think learned scale vs fixed scale for pitch).
Since systems based on mel filterbanks are also said to be less than ideal when working with noisy data, switching to LEAF could help if that’s all there is available. Of course this weakness alone has already spurred lots of research into learnable alternatives, most options currently available seem to lack in the performance department, though.
This is mostly down to the fact that having a trainable system means there are training parameters that need to be optimised in order to get good results. LEAF tries to work around this by using Gabor convolution layers, which sport just two parameters per filter.
Google’s first test results show LEAF scoring higher than the filterbanks on average accuracy across different tasks. Devs interested in the approach can find an implementation in TensorFlow on GitHub to verify the team’s findings. To make this easier, LEAF is designed as a drop-in replacement for mel filterbanks, since “any model that can be trained using mel filterbanks as input features, can also be trained on LEAF spectrograms.”
In future the team plans to get rid of the convolutional architecture with its fixed filter lengths and strides, and replace it with a system that can learn these elements to remove bias. It also expects some benefits for the analysis of seismic data and physiological recordings when using its general principle of learning to filter, pool and compress, so forays into the realm of non-audio signals could be next for at least some of the researchers.