TY - JOUR
T1 - Audio-Visual Emotion Classification Using Reinforcement Learning-Enhanced Particle Swarm Optimisation
AU - Kondrotas, Karolis
AU - Zhang, Li
AU - Lim, Chee Peng
AU - Asadi, Houshyar
AU - Yu, Yonghong
PY - 2025/7/21
Y1 - 2025/7/21
N2 - The extraction of fine-grained spatial-temporal characteristics for emotion classification is a challenging task owing to the subtlety and ambiguity of emotional expressions through video and audio channels. In this research, we propose an audio-visual ensemble model, comprising a two-stream 3D Convolutional Neural Network (CNN) architecture with RGB and optical flow as inputs for video emotion classification, as well as a variant of Wav2Vec2 for audio emotion recognition. The Wav2Vec2 variant integrates additional recurrent and attention layers with each transformer block to extract long- and short-term dependencies. A new Particle Swarm Optimisation (PSO) algorithm is proposed to fine-tune hyper-parameters of 3D CNNs and the enhanced Wav2Vec2, and formulate audio-visual ensemble models with the smallest sizes. It integrates a reinforcement learning (RL) algorithm, i.e. Asynchronous Advantage Actor-Critic (A3C), for search parameter and hybrid leader construction, and another RL algorithm, Proximal Policy Optimisation (PPO), for search action selection, as well as hypotrochoid and super formula-based search operations. Evaluated using audio-visual emotion datasets, our evolving ensemble model outperforms those devised by other search methods and existing state-of-the-art deep networks, significantly.
AB - The extraction of fine-grained spatial-temporal characteristics for emotion classification is a challenging task owing to the subtlety and ambiguity of emotional expressions through video and audio channels. In this research, we propose an audio-visual ensemble model, comprising a two-stream 3D Convolutional Neural Network (CNN) architecture with RGB and optical flow as inputs for video emotion classification, as well as a variant of Wav2Vec2 for audio emotion recognition. The Wav2Vec2 variant integrates additional recurrent and attention layers with each transformer block to extract long- and short-term dependencies. A new Particle Swarm Optimisation (PSO) algorithm is proposed to fine-tune hyper-parameters of 3D CNNs and the enhanced Wav2Vec2, and formulate audio-visual ensemble models with the smallest sizes. It integrates a reinforcement learning (RL) algorithm, i.e. Asynchronous Advantage Actor-Critic (A3C), for search parameter and hybrid leader construction, and another RL algorithm, Proximal Policy Optimisation (PPO), for search action selection, as well as hypotrochoid and super formula-based search operations. Evaluated using audio-visual emotion datasets, our evolving ensemble model outperforms those devised by other search methods and existing state-of-the-art deep networks, significantly.
U2 - 10.1109/TAFFC.2025.3591356
DO - 10.1109/TAFFC.2025.3591356
M3 - Article
SN - 1949-3045
SP - 18
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
ER -