Cluster search optimisation of deep neural networks for audio emotion classification

Sam Slade, Li Zhang, Houshyar Asadi, Chee Peng Lim, Yonghong Yu, Dezong Zhao, Arjun Panesar, Philip Fei Wu, Rong Gao

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Automated patient monitoring solutions greatly benefit from audio emotion classification, although the considerable variance in individual expression and interpretation of emotions poses a challenge. Current approaches often employ standard Audio Spectrogram Transformer (AST) and deep learning models such as Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN)-based networks. However, their performance can be enhanced by integrating neural architecture search techniques using swarm optimisation algorithms. In this research, we explore AST with hyperparameter optimisation for speech emotion recognition. Three deep learning architectures with optimisable -block structures and variable filter numbers, i.e. 1DCNN, bidirectional LSTM (BiLSTM) and CNN-BiLSTM, are also proposed, enabling the optimisation of network depth and width. A novel Cluster Search Optimisation (CSO) algorithm is introduced. It incorporates Cluster Centroid Search, a Cluster Distance Improvement metric and reinforcement learning to dispatch different search actions based on clustering convergence and -learning strategies, respectively. A novel Noise Tempered K-means (NTKM) clustering model is also proposed with the integration of Gaussian-based noise insertion and cluster compactness-separation measurement, to further fine-tune the cluster centriods obtained using OPTICS clustering. CSO is used for hyperparameter and architecture search for AST and aforementioned deep networks. Attention mechanisms are also integrated with CSO-optimised networks to further enhance feature learning. We evaluate the resulting models against those devised by other optimisation algorithms across the EMO-DB, SAVEE, and TESS datasets. The empirical results demonstrate that CSO-optimised AST and CNN-BiLSTM with attention mechanisms outperform other architectures and yield favourable comparison results against those from existing state-of-the-art audio emotion classification methods.
Original languageEnglish
Article number113223
Number of pages25
JournalKnowledge-Based Systems
Volume314
Early online date1 Mar 2025
DOIs
Publication statusE-pub ahead of print - 1 Mar 2025

Cite this