Abstract
Action Recognition in videos is a topic of interest in the area of computer vision, due to potential applications such as multimedia indexing and surveillance in public areas. In this research, we first propose spatial and temporal Convolutional Neural Network (CNNs), based on transfer learning using ResNet101, GoogleNet and VGG16, for undertaking human action recognition. Besides that, hybrid networks such as CNNRecurrent Neural Network (RNN) models are also exploited as encoder-decoder architectures for video action classification. In particular, different types of RNNs such as Long Short-Term Memory (LSTM), Bidirectional-LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Bidirectional-GRU (BiGRU), are exploited as the decoders for action recognition. To further enhance performance, diverse aggregation networks of CNN and CNN-RNN models are implemented. Specifically, an Average Fusion method is used to integrate spatial and temporal CNNs trained on images, as well as CNN-RNN trained on videos, where the final classification is formed by combining Softmax scores of these models via a late fusion. A total of 22 models (1 motion CNN, 3 spatial CNNs, 12 CNN-RNNs and 6 fusion networks) are implemented which are evaluated using UCF11, UCF50, and UCF101 datasets for performance comparison. The empirical results indicate the significant efficiency of Average Fusion of multiple Spatial-CNNs with one Motion-CNN, and ResNet101-BiGRU, among all the networks for undertaking realistic video action recognition.
Original language | English |
---|---|
Title of host publication | IEEE International Conference on Systems, Man, and Cybernetics |
Place of Publication | USA |
Pages | 4852-4858 |
Number of pages | 7 |
DOIs | |
Publication status | Published - 29 Jan 2024 |