2020-10-15	周报#08	刘潘

#I. Task achieved last week

了解了一下ResNet50的结构，以及TSM进行shift的代码细节。
阅读了《TEA: Temporal Excitation and Aggregation for Action Recognition》的源码，了解了相关改进细节，ME模块和MTA模块。
验证猜想的有效性。

#II. Reports

ResNet50模型结构：

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
	(0): Bottleneck(
	  (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	  (downsample): Sequential(
		(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
		(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  )
	)
	(1): Bottleneck(
	  (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(2): Bottleneck(
	  (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
  )
  (layer2): Sequential(
	(0): Bottleneck(
	  (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	  (downsample): Sequential(
		(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
		(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  )
	)
	(1): Bottleneck(
	  (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(2): Bottleneck(
	  (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(3): Bottleneck(
	  (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
  )
  (layer3): Sequential(
	(0): Bottleneck(
	  (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	  (downsample): Sequential(
		(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
		(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  )
	)
	(1): Bottleneck(
	  (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(2): Bottleneck(
	  (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(3): Bottleneck(
	  (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(4): Bottleneck(
	  (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(5): Bottleneck(
	  (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
  )
  (layer4): Sequential(
	(0): Bottleneck(
	  (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	  (downsample): Sequential(
		(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
		(1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  )
	)
	(1): Bottleneck(
	  (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
	(2): Bottleneck(
	  (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
	  (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
	  (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
	  (relu): ReLU(inplace)
	)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=2048, out_features=1000, bias=True)
)

def make_temporal_shift(net, n_segment, n_div=8, place='blockres', temporal_pool=False):
	if temporal_pool:
		n_segment_list = [n_segment, n_segment // 2, n_segment // 2, n_segment // 2]
	else:
		n_segment_list = [n_segment] * 4
	assert n_segment_list[-1] > 0
	#print('=> n_segment per stage: {}'.format(n_segment_list))

	import torchvision
	if isinstance(net, torchvision.models.ResNet):
		if place == 'block':
			def make_block_temporal(stage, this_segment):
				blocks = list(stage.children())
				#print('=> Processing stage with {} blocks'.format(len(blocks)))
				for i, b in enumerate(blocks):
					blocks[i] = TemporalShift(b, n_segment=this_segment, n_div=n_div)
				return nn.Sequential(*(blocks))

			net.layer1 = make_block_temporal(net.layer1, n_segment_list[0])
			net.layer2 = make_block_temporal(net.layer2, n_segment_list[1])
			net.layer3 = make_block_temporal(net.layer3, n_segment_list[2])
			net.layer4 = make_block_temporal(net.layer4, n_segment_list[3])

		elif 'blockres' in place:
			n_round = 1
			if len(list(net.layer3.children())) >= 23: # 如果是ResNet101的话，就每两轮采一次
				n_round = 2
				#print('=> Using n_round {} to insert temporal shift'.format(n_round))

			def make_block_temporal(stage, this_segment):
				blocks = list(stage.children())
				#print('=> Processing stage with {} blocks residual'.format(len(blocks)))
				for i, b in enumerate(blocks):
					if i % n_round == 0:
						blocks[i].conv1 = TemporalShift(b.conv1, n_segment=this_segment, n_div=n_div)
				return nn.Sequential(*blocks)

			net.layer1 = make_block_temporal(net.layer1, n_segment_list[0])
			net.layer2 = make_block_temporal(net.layer2, n_segment_list[1])
			net.layer3 = make_block_temporal(net.layer3, n_segment_list[2])
			net.layer4 = make_block_temporal(net.layer4, n_segment_list[3])
	else:
		raise NotImplementedError(place)

简单来说，就是对每一层的第一个卷积操作之前，进行shift操作。

@staticmethod
def shift(x, n_segment, fold_div=3, inplace=False):
	nt, c, h, w = x.size()
	n_batch = nt // n_segment
	x = x.view(n_batch, n_segment, c, h, w)

	fold = c // fold_div # fold_div 为 8，论文中说他们测试发现1/4的shift，即左1/8和右1/8效果最好
	if inplace:
		# Due to some out of order error when performing parallel computing. 
		# May need to write a CUDA kernel.
		raise NotImplementedError  
		# out = InplaceShift.apply(x, fold)
	else:
		out = torch.zeros_like(x) # 生成一个和x维度信息一样，但是全是0的数组
		# 假设原来是 [1,2,3,4,5,6,7,8]
		
		out[:, :-1, :fold] = x[:, 1:, :fold]  # shift left
		# 变成了 [2,3,4,5,6,7,8,0]
		
		out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold]  # shift right
		# 变成了 [0,1,2,3,4,5,6,7]
		
		out[:, :, 2 * fold:] = x[:, :, 2 * fold:]  # not shift
		# 其余的不发生变化
	
	return out.view(nt, c, h, w)

对通道进行分片，前1/8的通道左移，再1/8的通道右移，剩余的保持不变。

#TEA: Temporal Excitation and Aggregation for Action Recognition

#motion excitation(ME)

解决的问题是：short-range motion encoding。

之前的工作一般都是像素级的运动表现特征，比如光流，论文从feature层面上进行了运动建模。

不同的通道可能会对不同的消息敏感，提供模型自身的运动敏感信息捕捉能力十分重要。

The intuition of the proposed ME module is that, among all feature channels, different channels would capture distinct information.
For action recognition, it is beneficia to enable the model to discover and then enhance these motion-sensitive channels.

给定输入：

$$
\mathbf{X} \in \mathbb{R}^{N \times T \times C \times H \times W}
$$

首先进行通道数降低，来减少计算量：

$$
\mathbf{X}^r = \mathrm{conv}_\mathit{red} * \mathbf{X}, \quad \mathbf{X}^r \in \mathbb{R}^{N \times T \times C/r \times H \times W}
$$

然后根据时间分片进行运动特征的表征，表征就是特征层相减的差值，不过在进行转换之前先进行了卷积操作，得到的$\mathbf{M}(t)$就是$t$时刻的运动特征，然规定$\mathbf{M}(t) = 0$：

$$
\mathbf{M}(t) = \mathrm{conv}_\mathit{trans} * \mathbf{X}^r(t+1) - \mathbf{X}^r(t), 1 \leq t \leq T-1, \mathbf{M}(t) \in \mathcal{R}^{N \times C/r \times H \times W}
$$

全局平均池化操作，把空间信息进行总结，因为这里是要对那些运动敏感的通道进行激活，所以空间信息并不是很重要：

$$
\mathbf{M}^{s} = \mathrm{Pool}(\mathbf{M}), \quad \mathbf{M}^{s} \in \mathbb{R}^{N \times T \times C/r \times 1 \times 1}
$$

恢复通道数至初始，通过$Sigmoid$函数进行激活：

$$
\mathbf{A} = 2\delta( \mathrm{conv}_\mathit{exp} * \mathbf{M}^s)-1, \quad \mathbf{A} \in \mathbb{R}^{N \times T \times C \times 1 \times 1}
$$

然后把权重$\mathbf{A}$乘到输入上，就能够对动作进行表征。但是会对一些背景信息进行抑制，所以用一种残差结构，来修正影响：

such an approach will suppress the static background scene information, which is also beneficial for action recognition.
we propose to adopt a residual connection to enhance motion information meanwhile preserve scene information.

$$
\mathbf{X}^{o} = \mathbf{X} + \mathbf{X} \odot \mathbf{A}, \quad \mathbf{X}^{o} \in \mathbb{R}^{N \times T \times C \times H \times W}
$$

#Multiple Temporal Aggregation(MTA)

这个模块的主要目的是为了增大时间维度的感受野，原理看起来很通俗易懂：

$$ \begin{array}{lr} \mathbf{X}^o_i = \mathbf{X}_i, & i=1 \\ \mathbf{X}^o_i = \mathrm{conv}_{\mathit{spa}}* ( \mathrm{conv}_{\mathit{temp}} * \mathbf{X}_i), & i=2 \\ \mathbf{X}^o_i = \mathrm{conv}_{\mathit{spa}}* ( \mathrm{conv}_{\mathit{temp}} * ( \mathbf{X}_i + \mathbf{X}^o_{i-1} ) ), & i=3,4 \\ \end{array} $$

即：

$$ \begin{array}{lr} \mathbf{X}^o_1 = \mathbf{X}_1\\ \mathbf{X}^o_2 = \mathrm{conv}_{\mathit{spa}}* ( \mathrm{conv}_{\mathit{temp}} * \mathbf{X}_1) \\ \mathbf{X}^o_3 = \mathrm{conv}_{\mathit{spa}}* ( \mathrm{conv}_{\mathit{temp}} * ( \mathbf{X}_3 + \mathbf{X}^o_2 ) ) \\ \mathbf{X}^o_4 = \mathrm{conv}_{\mathit{spa}}* ( \mathrm{conv}_{\mathit{temp}} * ( \mathbf{X}_4 + \mathbf{X}^o_3 ) ) \\ \end{array} $$

这个操作需要对$\mathbf{X}$进行处理，从$\left[ N,T,C,H,W\right]$变换到$\left[ NHW, C, T\right]$。

#III. Plan for this week

可能需要和老师沟通一下，确定一下接下来的研究计划。

周报-20201015

Contents