Abstract

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). The survey provides a comprehensive analysis of dataset origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets such as AudioSet, with over two million samples, and community platforms such as Freesound, with over one million samples. The survey evaluates acoustic and linguistic variability across datasets through principal component analysis of audio and text embeddings. The survey also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting specific areas requiring attention: multilingual dataset development, specialized domain coverage and improved dataset accessibility.
Original languageEnglish
Pages (from-to)20328-20360
Number of pages33
JournalIEEE Access
Volume13
DOIs
Publication statusPublished - 2025

Keywords

  • Surveys
  • Data models
  • Training
  • Gold
  • Decoding
  • Contrastive learning
  • Transformers
  • Source separation
  • Web sites
  • Video on demand
  • Audio-to-language learning
  • language-to-audio learning
  • audio-language datasets
  • review

Fingerprint

Dive into the research topics of 'Audio-Language Datasets of Scenes and Events: A Survey'. Together they form a unique fingerprint.

Cite this