2020-09-10 周报#04 刘潘

#I. Task achieved last week


  • 《Gate-Shift Networks for Video Action Recognition》
  • 《TSM Temporal Shift Module for Efficient Video Understanding》
  • 《PAN: Towards Fast Action Recognition via Learning Persistence of Appearance》这篇是今年八月新发的,在something-something-v1数据集是目前的top1。
  • TSM和PAN跑了实验测试了一下。

#II. Reports


#Gate-Shift Networks for Video Action Recognition

用于行为识别的Gate-Shift网络

在实践中,由于涉及大量的参数和计算,在缺乏足够大的数据集进行大规模训练的情况下,C3D可能表现不佳。

GSM

提出了一种Gate-Shift Module(GSM),将2D-CNN转换为高效的时空特征抽取器。

通过GSM插件,一个2D-CNN可以适应性地学习时间路由特性并将它们结合起来,并且几乎没有额外的附加参数和计算开销。

思路对比

传统的方法演变:C3D -> 2D spatial + 1D temporal -> CSN -> GST(与分离信道组上的二维和三维卷积并行空间和时空交互建模) -> TSM(时域卷积可以被限制为硬编码的时移,使一些信道在时间上向前或向后移动)

所有这些现有的方法都学习具有硬连线连接和跨网络传播模式的结构化内核。
在网络中的任何一点上都没有数据依赖的决策来选择地通过不同的分支来路由特性,分组和随机的模式是在设计之初就固定的,并且学习如何随机是具有组合复杂性的。

实验

From the experiments we conclude that adding GSM to the branch with the least number of convolution layers performs the best.

在卷积层最少的分支上添加GSM模块表现最好。

#TSM Temporal Shift Module for Efficient Video Understanding

核心思想

实现在 2D 模型上达到 3D 模型的精度,极大的降低了计算。

并不是所有的shift操作都可以达到效果的,虽然shift操作不需要额外的运算但是仍然需要数据的移动,太多的移动会带来延迟。

shift是增加时间特征的提取,太多的shift操作也会导致空间特征的提取受到影响。

作者思路

改进的shift策略:并不是shift所有的channels,而是只选择性的shift其中的一部分,该策略能够有效的减少数据移动所带来的时间复杂度。

TSM 并不是直接被插入到从前往后的干道中的,而是以旁路的形式进行,因此在获得了时序信息的同时不会对二维卷积的空间信息进行损害。

同时作者对于一些实时的在线检测提出了相应的模型策略,不同于将第一层下移第二层上移这种:

在线模型

可以有相应的借鉴思路,并且这篇也是上一篇的基准之一。

#实验结果

过程

结果

#相关链接

#PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

比光流网络快了1000倍

Our PA is over 1000× faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed

运动边界的微小位移在动作识别中起重要作用的角色。

According to the aforementioned anal- ysis, we can conclude that small displacements of motion boundaries play a vital role in action recognition.

低层的feature map之间的差异能更多地关注边界的变化。

the differences among low-level feature maps will pay more attention to the variations at boundaries.
In summary, differences in low-level feature maps can reflect small displacements of motion boundaries due to convolutional operations.

UCF101上做实验表明在第一层效果最好。

We define the basic conv-layer as eight 7×7 convolutions with stride=1 and padding=3, so that the spatial resolutions of the obtained feature maps are not reduced.

两种编码策略:

PA as motion modality

PA as attention

第一种无论是从计算量上还是从准确率上都要更好。

可能原因是第二种融合方法导致图像的不平衡。

However, for e2, attending appearance feature maps with PA will highlight the motion boundaries, leading to the imbalanced appearance responses both inside and at the boundaries of the moving objects.

Various-timescale Aggregation Pooling

#实验结果

#Lite
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
somethingv2: 174 classes
=> shift: True, shift_div: 8, shift_place: blockres
=> base model: resnet50
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/xiangyi/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████████████████████████████████| 97.8M/97.8M [00:06<00:00, 14.8MB/s]
=> Adding temporal shift...
=> Using 3-level VAP
=> Converting the ImageNet model to a PAN_Lite init model
=> Done. PAN_lite model ready...
video number:24777
video 0 done, total 0/24777, average 0.879 sec/video, moving Prec@1 65.625 Prec@5 87.500
video 1280 done, total 1280/24777, average 0.239 sec/video, moving Prec@1 60.491 Prec@5 85.640
video 2560 done, total 2560/24777, average 0.230 sec/video, moving Prec@1 60.518 Prec@5 85.671
video 3840 done, total 3840/24777, average 0.227 sec/video, moving Prec@1 60.015 Prec@5 85.374
video 5120 done, total 5120/24777, average 0.225 sec/video, moving Prec@1 60.031 Prec@5 85.475
video 6400 done, total 6400/24777, average 0.226 sec/video, moving Prec@1 59.855 Prec@5 85.334
video 7680 done, total 7680/24777, average 0.224 sec/video, moving Prec@1 59.775 Prec@5 85.292
video 8960 done, total 8960/24777, average 0.223 sec/video, moving Prec@1 59.519 Prec@5 85.284
video 10240 done, total 10240/24777, average 0.224 sec/video, moving Prec@1 59.530 Prec@5 85.423
video 11520 done, total 11520/24777, average 0.224 sec/video, moving Prec@1 59.686 Prec@5 85.497
video 12800 done, total 12800/24777, average 0.224 sec/video, moving Prec@1 59.678 Prec@5 85.487
video 14080 done, total 14080/24777, average 0.225 sec/video, moving Prec@1 59.637 Prec@5 85.464
video 15360 done, total 15360/24777, average 0.225 sec/video, moving Prec@1 59.349 Prec@5 85.315
video 16640 done, total 16640/24777, average 0.225 sec/video, moving Prec@1 59.327 Prec@5 85.327
video 17920 done, total 17920/24777, average 0.226 sec/video, moving Prec@1 59.058 Prec@5 85.204
video 19200 done, total 19200/24777, average 0.226 sec/video, moving Prec@1 59.121 Prec@5 85.206
video 20480 done, total 20480/24777, average 0.226 sec/video, moving Prec@1 59.200 Prec@5 85.295
video 21760 done, total 21760/24777, average 0.227 sec/video, moving Prec@1 59.283 Prec@5 85.337
video 23040 done, total 23040/24777, average 0.227 sec/video, moving Prec@1 59.254 Prec@5 85.422
video 24320 done, total 24320/24777, average 0.227 sec/video, moving Prec@1 59.277 Prec@5 85.421
[0.84482759 0.38815789 0.51633987 0.58252427 0.58974359 0.54385965
0.76738609 0.63636364 0.67716535 0.60264901 0.53932584 0.68613139
0.26923077 0.425 0.7122807 0.51914894 0.42639594 0.38157895
0.46025105 0.57345972 0.51574803 0.62280702 0.55232558 0.56382979
0.56818182 0.52631579 0.6 0.48514851 0.71818182 0.77394636
0.78378378 0.77477477 0.82954545 0.11650485 0.36144578 0.203125
0.84331797 0.82129278 0.22222222 0.79411765 0.71584699 0.73214286
0.59624413 0.62057878 0.72972973 0.51253482 0.5873494 0.40703518
0.42857143 0.80430108 0.7257384 0.06666667 0.40625 0.68571429
0.25 0.42 0.4109589 0.60377358 0.17647059 0.81654676
0.92 0.21568627 0.73417722 0.30841121 0.21621622 0.53301887
0.30188679 0.40298507 0.6754386 0.43 0.64285714 0.47826087
0.54411765 0.61538462 0.66981132 0.36842105 0.5 0.30769231
0.2962963 0.77586207 0.296875 0.168 0.36170213 0.44680851
0.64 0.44444444 0.85882353 0.792 0.25 0.19277108
0.56521739 0.85 0.57894737 0.764 0.76694915 0.13888889
0.14705882 0.2892562 0.51020408 0.67765568 0.46792453 0.62637363
0.29310345 0.8125 0.7480315 0.73333333 0.52054795 0.66502463
0.39189189 0.48 0.6056338 0.05555556 0.49640288 0.27777778
0.69097222 0.54347826 0.25925926 0.77777778 0.40677966 0.64356436
0.80314961 0.80681818 0.34975369 0.69230769 0.36538462 0.63761468
0.55339806 0.42608696 0.1302682 0.70955882 0.32142857 0.35616438
0.44827586 0.24561404 0.79619565 0.62269939 0.23529412 0.45
0.24770642 0.72727273 0.6627907 0.359375 0.59375 0.63311688
0.56050955 0.44680851 0.74166667 0.0859375 0.55230126 0.90754717
0.76902174 0.33152174 0.3877551 0.75229358 0.51181102 0.29268293
0.5 0.58695652 0.55045872 0.34545455 0.41284404 0.46052632
0.43925234 0.45 0.84398977 0.88709677 0.96428571 0.94444444
0.81666667 0.70666667 0.45098039 0.72115385 0.74418605 0.65693431]
upper bound: 0.548518228331384
-----Evaluation is finished------
Class Accuracy 53.84%
Overall Prec@1 59.26% Prec@5 85.45%
/home/xiangyi/miniconda3/envs/pan/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order, subok=True)

#相关链接

#III. Plan for this week


  • 看一下相关代码,了解一下他们改进的思路。
  • 继续读论文。