
To handle these issues, we propose a lightweight neural network called Group Frame Network (GFNet).

GFNet adds frame-level decomposition to extract features of each frame at a minuscule cost.

There are two core components: Group Temporal Module (GTM) and Group Spatial Module (GSM). These two modules enable GFNet to obtain temporal-spatial information only from RGB images.


To verify the validity of the model, no pre-training strategy is used in our experiments.


The entire video with a variable number of frames is provided as the input of the network. Through an average sampling strategy, the video is divided into N equal-length segments and only one frame is selected from each segment. Due to the repeatability of adjacent frames, this sampling strategy can reduce inter-frame redundancy while preserving long-temporal information.

The first part is a feature extraction layer consisting of K separated branches. The sampled frames are simultaneously fed into the network to maintain the temporal information among these frames.


In the feature extraction layer, each frame is learned independently using a network branch to get its spatial features.


all the sampled frames are stacked by channel-wise convolution.
It means that the input channel of GFNet is 3N when using RGB images as input.


Owing to the impressive performance and strong generalization abil- ity of residual architecture, the block is based on the highly modularized residual unit.


Considering the extraneous motion and identical texture features in sampled frames, GFNet decomposes frames and reduces the number of channels for each frame to lessen spatial redundancy. To be specific, the number of channels is equally divided among branches. It means that only a small number of channels are used per frame.

#Group Temporal Module


To leverage the inter-frame information effectively and better strengthen temporal relationships, GTM is proposed to efficiently overcome the side effects brought by the separated branch.

GTM consists of a translation layer and a 3D convolution layer.

The translation layer makes the replacement of the data dimension. It includes the channel merger and the channel separation, which achieves the conversion of the feature map from four-dimensional data to five-dimensional data.

#Group Spatial Module

For the convolution layer of ResNet50, the computational cost is closely related to the number of channels. Motivated by this, a novel module called GSM is designed to significantly decrease the number of parameters and computational efforts.

Because of the similarity among frames, the texture information is repetitive. Meanwhile, irrelevant motion inside frames increases the intra-frame redundancy. Aiming at minimizing the interference of redundant information, GSM diminishes the number of channels to extract features per frame.