SSELDnet - A Fully End-to-End Sample-Level Framework for Sound Event Localization and Detection

Abstract

Sound event localization and detection (SELD) is a multi-task learning problem that aims to detect different audio events and estimate their corresponding locations. All of the previously proposed SELD systems were based on human-extracted features such as Mel-spectrograms to make the prediction, which required specific prior knowledge in acoustics. In this report, we investigate the possibility to apply representation learning directly to the raw audio and propose an end-to-end sample-level SELD framework. To improve generalization, we applied three data augmentation tricks: sound field rotation, time masking and random audio equalization. The proposed system is evaluated on the TAU-NIGENS Spatial Sound Events 2021 development dataset. The experimental results will be submitted to DCASE 2021 challenge task 3.

Publication
In Detection and Classification of Acoustic Scenes and Events 2021
Daolang Huang
Daolang Huang
Doctoral candidate at Aalto University

My research interests include amortized inference, deep learning, probabilistic machine learning and audio information retrieval