#GFNET: A LIGHTWEIGHT GROUP FRAME NETWORK FOR EFFICIENT HUMAN ACTION RECOGNITION

为了解决现有的行为识别方法计算量大参数多的问题,作者提出了GFNet

To handle these issues, we propose a lightweight neural network called Group Frame Network (GFNet).

GFNet 通过帧级分解,以极小的代价提取每一帧的特征。

GFNet adds frame-level decomposition to extract features of each frame at a minuscule cost.

设计了两个核心组件,能够仅从RGB图像中提取时空信息,而不需要借助光流multi-scale testing等。

There are two core components: Group Temporal Module (GTM) and Group Spatial Module (GSM). These two modules enable GFNet to obtain temporal-spatial information only from RGB images.

为了证明模型的有效性,他们没有使用预训练策略。

To verify the validity of the model, no pre-training strategy is used in our experiments.

网络的输入是一定数量视频帧,对视频进行分段采样得到结果。

The entire video with a variable number of frames is provided as the input of the network. Through an average sampling strategy, the video is divided into N equal-length segments and only one frame is selected from each segment. Due to the repeatability of adjacent frames, this sampling strategy can reduce inter-frame redundancy while preserving long-temporal information.

K 个层同时输入网络获取时间信息。

The first part is a feature extraction layer consisting of K separated branches. The sampled frames are simultaneously fed into the network to maintain the temporal information among these frames.

各自的分支独立计算获得空间信息。

In the feature extraction layer, each frame is learned independently using a network branch to get its spatial features.

没太理解。

all the sampled frames are stacked by channel-wise convolution.
It means that the input channel of GFNet is 3N when using RGB images as input.

由于残差网络具有高度的泛化能力和性能,所以后续的块选择高度模块化残差单元。

Owing to the impressive performance and strong generalization abil- ity of residual architecture, the block is based on the highly modularized residual unit.

这句话没太理解。

Considering the extraneous motion and identical texture features in sampled frames, GFNet decomposes frames and reduces the number of channels for each frame to lessen spatial redundancy. To be specific, the number of channels is equally divided among branches. It means that only a small number of channels are used per frame.

#Group Temporal Module

因为每个分支都是单独计算的,所以势必会降低准确率(没有提取帧间的联系),所以提出了GTM模块。

To leverage the inter-frame information effectively and better strengthen temporal relationships, GTM is proposed to efficiently overcome the side effects brought by the separated branch.

GTM consists of a translation layer and a 3D convolution layer.

The translation layer makes the replacement of the data dimension. It includes the channel merger and the channel separation, which achieves the conversion of the feature map from four-dimensional data to five-dimensional data.

#Group Spatial Module

For the convolution layer of ResNet50, the computational cost is closely related to the number of channels. Motivated by this, a novel module called GSM is designed to significantly decrease the number of parameters and computational efforts.

GSM 也是通过只取纹理来最小化计算成本。

Because of the similarity among frames, the texture information is repetitive. Meanwhile, irrelevant motion inside frames increases the intra-frame redundancy. Aiming at minimizing the interference of redundant information, GSM diminishes the number of channels to extract features per frame.