Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classification

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

The Self-Supervised Audio Spectrogram Transformer (SSAST) has recently been verified as the state-of-the-art model in various audio and speech command classification tasks. SSAST uses self-supervised learning to reduce the need of substantial data to pre-train transformers, removing the disadvantage of its supervised learning counterpart, the Audio Spectrogram Transformer (AST) model. Owing to the fact that transformers such as SSAST use only feature representations from the last layer for downstream classification tasks, we believe that such a process will lose some important information from middle layers during training. Therefore, in this research, we propose a novel variant of the SSAST model using a dual representation generated using fusion of the outputs from multi-layers (i.e. both middle and last layers) for audio classification. Specifically, we apply all-patch-wise pooling combinations to all patches from both a middle layer and the last layer of a pre-trained patch-based self-supervised learning model. As such, it generates two individual sequences of the output patches based on a variety of 𝑚𝑒𝑎𝑛, 𝑚𝑎𝑥, and 𝑚𝑖𝑛 pooling combinations to make the final double-sized representation. This dual representation includes more discriminative information and better knowledge, providing the linear multi-layer perceptron head layers with more useful information for audio classification. In comparison with existing state-of-the-art studies, the proposed model using the dual representations yielded by multi-layer feature fusion and pooling combinations significantly boosts performance on all tasks. The resulting accuracy rates are 93.67%, 100%, 79.59%, 79.59%, 91.22%, and 85.90% for CREMA-D, TESS, RAVDESS, Speech Emotion Classification, Isolated Urban Events, and CornellBirdCall, respectively.
Original languageEnglish
Article number129415
Number of pages15
JournalNeurocomputing
Volume623
Early online date17 Jan 2025
DOIs
Publication statusE-pub ahead of print - 17 Jan 2025

Cite this