Automatic interpretation of human actions from realistic videos attracts increasing research attention owing to its growing demand in real-world deployments such as biometrics, intelligent robotics, and surveillance. In this research, we propose an ensemble model of evolving deep networks comprising Convolutional Neural Networks (CNNs) and bidirectional Long Short-Term Memory (BLSTM) networks for human action recognition. A swarm intelligence (SI)-based algorithm is also proposed for identifying the optimal hyper-parameters of the deep networks. The SI algorithm plays a crucial role for determining the BLSTM network and learning configurations such as the learning and dropout rates and the number of hidden neurons, in order to establish effective deep features that accurately represent the temporal dynamics of human actions. The proposed SI algorithm incorporates hybrid crossover operators implemented by sine, cosine, and tanh functions for multiple elite offspring signal generation, as well as geometric search coefficients extracted from a three-dimensional super-ellipse surface. Moreover, it employs a versatile search process led by the yielded promising offspring solutions to overcome stagnation. Diverse CNN–BLSTM networks with distinctive hyper-parameter settings are devised. An ensemble model is subsequently constructed by aggregating a set of three optimized CNN–BLSTM networks based on the average prediction probabilities. Evaluated using several publicly available human action data sets, our evolving ensemble deep networks illustrate statistically significant superiority over those with default and optimal settings identified by other search methods. The proposed SI algorithm also shows great superiority over several other methods for solving diverse high-dimensional unimodal and multimodal optimization functions with artificial landscapes.