Audio-Visual Emotion Classification Using Reinforcement Learning-Enhanced Particle Swarm Optimisation

Karolis Kondrotas, Li Zhang, Chee Peng Lim, Houshyar Asadi, Yonghong Yu

Research output: Contribution to journalArticlepeer-review

4 Downloads (Pure)

Abstract

The extraction of fine-grained spatial-temporal characteristics for emotion classification is a challenging task owing to the subtlety and ambiguity of emotional expressions through video and audio channels. In this research, we propose an audio-visual ensemble model, comprising a two-stream 3D Convolutional Neural Network (CNN) architecture with RGB and optical flow as inputs for video emotion classification, as well as a variant of Wav2Vec2 for audio emotion recognition. The Wav2Vec2 variant integrates additional recurrent and attention layers with each transformer block to extract long- and short-term dependencies. A new Particle Swarm Optimisation (PSO) algorithm is proposed to fine-tune hyper-parameters of 3D CNNs and the enhanced Wav2Vec2, and formulate audio-visual ensemble models with the smallest sizes. It integrates a reinforcement learning (RL) algorithm, i.e. Asynchronous Advantage Actor-Critic (A3C), for search parameter and hybrid leader construction, and another RL algorithm, Proximal Policy Optimisation (PPO), for search action selection, as well as hypotrochoid and super formula-based search operations. Evaluated using audio-visual emotion datasets, our evolving ensemble model outperforms those devised by other search methods and existing state-of-the-art deep networks, significantly.
Original languageEnglish
Pages (from-to)18
JournalIEEE Transactions on Affective Computing
Early online date21 Jul 2025
DOIs
Publication statusE-pub ahead of print - 21 Jul 2025

Cite this