TY - GEN
T1 - A Convolutional Recurrent Neural Network with Spatial Feature Fusion for Environmental Sound Classification
AU - Mhatre, Meehir
AU - Zhang, Li
AU - Panesar, Arjun
PY - 2024/7/30
Y1 - 2024/7/30
N2 - This research proposes a new Convolutional Recurrent Neural Network (CRNN) model with spatial feature fusion for environmental sound classification. Besides data preprocessing such as spectrogram transformation and data augmentation, customized deep networks, i.e. VGG19, ResNet152, and EfficientNetB0, with additional layers, are also proposed for audio classification. Specifically, the proposed CRNN model embeds ResNet152 and EfficientNetB0 in the encoder where spatial features extracted by both networks are concatenated. A Long Short-Term Memory (LSTM) component is used as the decoder in the proposed CRNN for temporal feature extraction. Evaluated using the ESC-50 dataset, the proposed CRNN model with a multi-channel spatial feature fusion, outperforms the customized VGG19, ResNet152, EfficientNetB0 networks as well as existing studies, significantly. The spatial feature fusion in conjunction with LSTM-based sequential feature extraction accounts for the superiority of the proposed CRNN model for environmental sound classification.
AB - This research proposes a new Convolutional Recurrent Neural Network (CRNN) model with spatial feature fusion for environmental sound classification. Besides data preprocessing such as spectrogram transformation and data augmentation, customized deep networks, i.e. VGG19, ResNet152, and EfficientNetB0, with additional layers, are also proposed for audio classification. Specifically, the proposed CRNN model embeds ResNet152 and EfficientNetB0 in the encoder where spatial features extracted by both networks are concatenated. A Long Short-Term Memory (LSTM) component is used as the decoder in the proposed CRNN for temporal feature extraction. Evaluated using the ESC-50 dataset, the proposed CRNN model with a multi-channel spatial feature fusion, outperforms the customized VGG19, ResNet152, EfficientNetB0 networks as well as existing studies, significantly. The spatial feature fusion in conjunction with LSTM-based sequential feature extraction accounts for the superiority of the proposed CRNN model for environmental sound classification.
U2 - 10.1142/9789811294631_0035
DO - 10.1142/9789811294631_0035
M3 - Conference contribution
SN - 978-981-12-9462-4
VL - 14
T3 - World Scientific Proceedings Series on Computer Engineering and Information Science
SP - 275
EP - 282
BT - Intelligent Management of Data and Information in Decision Making
ER -