Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition
#Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition
HMDB51
和UCF101
的top-1
达到现阶段最好的水平,分别为85.10%
和98.69%
。
时序全局平均池化阻碍了时序信息的更丰富表达。虽然感受野可能分布在整个视频切片中,但是对于不同的切片,它所能提供的信息是服从高斯分布的,简单的平均可能会损失信息。
temporal global average pooling (TGAP) layer is used at the end of all 3D CNN architectures hinders the richness of final temporal information.
the receptive field might cover the whole clip, the effective receptive field has a Gaussian distribution