Abstract
Human action recognition can be applied in a multitude of fully diversified domains such as active large-scale surveillance, threat detection, personal safety in hazardous environments, human assistance, health monitoring, and intelligent robotics. Owing to its high demands in real-world applications, it has drawn significant attention. In this research, we propose hybrid deep neural networks, i.e. Convolutional Long Short-Term Memory (ConvLSTM) Networks, Long-term Recurrent Convolutional Networks (LRCN), for tackling video action classification. In particular, for the LRCN model, different CNN encoder architectures such as VGG16, ResNet50, DenseNet121 and MobileNet, as well as several Long Short-Term Memory (LSTM) variant decoder architectures, such as LSTM, bidirectional LSTM (BiLSTM) and Gated Recurrent Unit (GRU), are used for spatial-temporal feature extraction to test model performance. We adopt diverse experimental settings including using different numbers of frames per video and learning configurations to optimize performance. The empirical results indicate the superiority of MobileNet in combination with a BiLSTM network over other hybrid network settings for the action classification using the UCF50 dataset. Owing to the lightweight MobileNet encoder, this LRCN model also achieves a better trade-off between performance and training and inference computational costs, while outperforming existing state-of-the-art methods.
Original language | English |
---|---|
Title of host publication | International Joint Conference on Neural Networks (IJCNN) |
Place of Publication | Italy |
Publisher | IEEE |
Number of pages | 8 |
ISBN (Electronic) | 978-1-7281-8671-9 |
ISBN (Print) | 978-1-6654-9526-4 |
DOIs | |
Publication status | Published - 30 Sept 2022 |