BAST-Mamba: Binaural Audio Spectrogram Mamba Transformer for binaural sound localization

  • Sheng Kuang
  • , Jie Shi
  • , Kiki van der Heijden
  • , Siamak Mehrkanoon*
  • *Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Accurate sound localization in reverberant environments is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been used to model the binaural human auditory pathway. However, CNNs face limitations in capturing global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Mamba Transformer (BAST-Mamba) model to predict sound azimuth in both anechoic and reverberant conditions. We explore two implementation modes: BAST-Mamba-SP and BAST-Mamba-NSP, which correspond to shared and non-shared parameter configurations, respectively. Our best model BAST-Mamba-SP, equipped with subtraction-based interaural integration and a hybrid loss function, achieves a state-of-the-art angular distance (AD) error of 0.89°and mean squared error of 0.0004, significantly outperforming baseline models. The model demonstrates generalization across acoustic environments, robust hemifield symmetry and high accurate real-time localization performance (<4°AD at 300 ms). Moderate noise augmentation at 30 dB SNR yields the strongest noise resilience. Explainability analyses highlight consistent frequency focus in the 2–3 kHz and 5.5–6.5 kHz bands, aligning with known neurophysiological cues. These results validate the potential of neurobiologically inspired Transformer for robust, high-precision sound localization and offer new insights into human sound localization.

Original languageEnglish
Article number130804
JournalNeurocomputing
Volume650
Early online date2025
DOIs
Publication statusPublished - 14 Oct 2025

Keywords

  • Binaural integration
  • Sound localization
  • Transformer

Fingerprint

Dive into the research topics of 'BAST-Mamba: Binaural Audio Spectrogram Mamba Transformer for binaural sound localization'. Together they form a unique fingerprint.

Cite this