TY - JOUR
T1 - Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classification
AU - Choi, Hyosun
AU - Zhang, Li
AU - Watkins, Chris
PY - 2025/1/17
Y1 - 2025/1/17
N2 - The Self-Supervised Audio Spectrogram Transformer (SSAST) has recently been verified as the state-of-the-art model in various audio and speech command classification tasks. SSAST uses self-supervised learning to reduce the need of substantial data to pre-train transformers, removing the disadvantage of its supervised learning counterpart, the Audio Spectrogram Transformer (AST) model. Owing to the fact that transformers such as SSAST use only feature representations from the last layer for downstream classification tasks, we believe that such a process will lose some important information from middle layers during training. Therefore, in this research, we propose a novel variant of the SSAST model using a dual representation generated using fusion of the outputs from multi-layers (i.e. both middle and last layers) for audio classification. Specifically, we apply all-patch-wise pooling combinations to all patches from both a middle layer and the last layer of a pre-trained patch-based self-supervised learning model. As such, it generates two individual sequences of the output patches based on a variety of 𝑚𝑒𝑎𝑛, 𝑚𝑎𝑥, and 𝑚𝑖𝑛 pooling combinations to make the final double-sized representation. This dual representation includes more discriminative information and better knowledge, providing the linear multi-layer perceptron head layers with more useful information for audio classification. In comparison with existing state-of-the-art studies, the proposed model using the dual representations yielded by multi-layer feature fusion and pooling combinations significantly boosts performance on all tasks. The resulting accuracy rates are 93.67%, 100%, 79.59%, 79.59%, 91.22%, and 85.90% for CREMA-D, TESS, RAVDESS, Speech Emotion Classification, Isolated Urban Events, and CornellBirdCall, respectively.
AB - The Self-Supervised Audio Spectrogram Transformer (SSAST) has recently been verified as the state-of-the-art model in various audio and speech command classification tasks. SSAST uses self-supervised learning to reduce the need of substantial data to pre-train transformers, removing the disadvantage of its supervised learning counterpart, the Audio Spectrogram Transformer (AST) model. Owing to the fact that transformers such as SSAST use only feature representations from the last layer for downstream classification tasks, we believe that such a process will lose some important information from middle layers during training. Therefore, in this research, we propose a novel variant of the SSAST model using a dual representation generated using fusion of the outputs from multi-layers (i.e. both middle and last layers) for audio classification. Specifically, we apply all-patch-wise pooling combinations to all patches from both a middle layer and the last layer of a pre-trained patch-based self-supervised learning model. As such, it generates two individual sequences of the output patches based on a variety of 𝑚𝑒𝑎𝑛, 𝑚𝑎𝑥, and 𝑚𝑖𝑛 pooling combinations to make the final double-sized representation. This dual representation includes more discriminative information and better knowledge, providing the linear multi-layer perceptron head layers with more useful information for audio classification. In comparison with existing state-of-the-art studies, the proposed model using the dual representations yielded by multi-layer feature fusion and pooling combinations significantly boosts performance on all tasks. The resulting accuracy rates are 93.67%, 100%, 79.59%, 79.59%, 91.22%, and 85.90% for CREMA-D, TESS, RAVDESS, Speech Emotion Classification, Isolated Urban Events, and CornellBirdCall, respectively.
UR - https://www.sciencedirect.com/science/article/pii/S0925231225000876?via%3Dihub
U2 - 10.1016/j.neucom.2025.129415
DO - 10.1016/j.neucom.2025.129415
M3 - Article
SN - 0925-2312
VL - 623
JO - Neurocomputing
JF - Neurocomputing
M1 - 129415
ER -