Abstract
The extraction of fine-grained spatial-temporal characteristics for emotion classification is a challenging task owing to the subtlety and ambiguity of emotional expressions through video and audio channels. In this research, we propose an audio-visual ensemble model, comprising a two-stream 3D Convolutional Neural Network (CNN) architecture with RGB and optical flow as inputs for video emotion classification, as well as a variant of Wav2Vec2 for audio emotion recognition. The Wav2Vec2 variant integrates additional recurrent and attention layers with each transformer block to extract long- and short-term dependencies. A new Particle Swarm Optimisation (PSO) algorithm is proposed to fine-tune hyper-parameters of 3D CNNs and the enhanced Wav2Vec2, and formulate audio-visual ensemble models with the smallest sizes. It integrates a reinforcement learning (RL) algorithm, i.e. Asynchronous Advantage Actor-Critic (A3C), for search parameter and hybrid leader construction, and another RL algorithm, Proximal Policy Optimisation (PPO), for search action selection, as well as hypotrochoid and super formula-based search operations. Evaluated using audio-visual emotion datasets, our evolving ensemble model outperforms those devised by other search methods and existing state-of-the-art deep networks, significantly.
| Original language | English |
|---|---|
| Pages (from-to) | 18 |
| Journal | IEEE Transactions on Affective Computing |
| Early online date | 21 Jul 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 21 Jul 2025 |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver