当前位置：网站首页>7 kinds of visual MLP finishing (Part 2)

7 kinds of visual MLP finishing (Part 2)

2022-07-19 05:47:00 【byzy】

One 、RepMLP

Link to the original text ：https://arxiv.org/pdf/2105.01883.pdf

RepMLP（re-parameterized MLP） Taking into account FC Compared with convolution, layer is not good at capturing local information . Its training and inference are different .

Training phase By global perceptron ,partition Perceptron and local perceptron .

Global perceptron

take feature map Is divided into partition. To capture partition Interaction between , Use average pooling to process each partition, Input to BN and 2 Layer of MLP, then reshape, Add to partition map On .

partition perceptron

By a FC Layer and the BN layers , With partition map As input .FC The layer is similar group Convolution group FC To reduce parameters .

group FC When programming group $1\times1$ Convolution realization , Steps are as follows ：（1） take $V^{(\textup{in})}$ reshape Space dimension is $1\times1$ Of feature map;（2） Use In groups $1\times1$ Convolution processing ;（3） Will deal with the feature map reshape by $V^{(\textup{out})}$ . namely ：
${M}'=\textup{RS}(V^{(\textup{in})},(\textup{N},P,1,1)),{F}'=\textup{RS}(W,(Q,\frac{P}{g},1,1))$
$\textup{gMMUL}(V^{(\textup{in})},W,g)=\textup{RS}(\textup{gConv}({M}',{F}',g,0),(\textup{N},Q))$
RS Express reshape,gMMUL Express group FC. by group FC The weight matrix of （ Size should be ）, Express The conversion group Convolution kernel （ A nuclear , The size of each core is ）, and Respectively FC Dimension of layer input and output . And there should be $\textup{N}=NHW/hw$ ,,.

Local perceptron

take partition map Through multiple parallel convolution layers （ Keep the resolution consistent with the input , Followed by BN）, Number of convolution groups Should and Partition Same in perceptron

Finally, all convolution outputs sum Partition The outputs of the perceptron are added , Restore shape , Get the final output .

Inference stage Will be able to RepMLP Turn into 3 individual FC layer .

The key is two steps ：

1. take BN merge To the previous convolution ：

${F}'_{i.:,:,:}=\frac{\gamma_i}{\sigma_i}F_{i.:,:,:},{b}'=-\frac{\mu_i\gamma_i}{\sigma_i}+\beta_i$

2. Convert convolution into FC layer （ by The unit matrix of dimension ）：

$M^{(I)}=\textup{RS}(I,(Chw,C,h,w)),W^{(F,p)}=\textup{RS}(\textup{Conv}(M^{(I)},F,p),(Chw,Ohw))^T$

among by padding, Is the convolution kernel , $W^{(F,p)}$ by FC The weight of the layer .

In this way, we can put FC3 Convolution merging with local perceptron .

Two 、ResMLP

Link to the original text ：https://arxiv.org/pdf/2105.03404.pdf

Res Express residual.

Model structure

First, divide the original image into $N\times N$ individual patch, Then we get Whitman's sign , Input to ResMLP in . In the picture A Affine transformation by column ;T Transpose .

Residual Multi-Perceptron layer

Linear layer + Feedforward layer . Not used LN, Affine transformation is applied to each column ：

$\textup{Aff}_{\alpha,\beta}(x)=\textup{Diag}(\alpha)x+\beta$

This transformation is performed twice in each residual block （ The two times are called pre and post）. They will be integrated into the linear layer when inferring .

Feedforward network and Transformer equally , Double layer MLP, The activation function becomes GELU.

$Z=X+\textup{Aff}\left ((A\: \textup{Aff}(X)^T)^T \right ),Y=Z+\textup{Aff}\left ( C\: \textup{GELU}(B\: \textup{Aff(Z)}) \right )$

among A,B,C Is the linear layer weight , by $N^2\times N^2$ dimension , by $4d\times d$ dimension , by $d\times 4d$ dimension .

3、 ... and 、S²-MLPv2

Link to the original text ：https://arxiv.org/pdf/2108.01072.pdf

S²-MLP

patch embedding layer + Several S²-MLP block + Sort head

patch embedding Layers divide the image into $p\times p$ Of patch, And then through FC obtain Dimension vector .

S²-MLP block

4 There are... That act on the channel dimension MLP+ Spatial shift layer .

Spatial shift ： take It is divided into 4 Share , Then translate along the positive and negative directions of length and width respectively 1 A unit of .

split attention

set up Size is $n\times c$ Of feature map $\{X_k\}_{k=1}^K$ ; among by patch Count , Is the number of channels . Sum along the spatial dimension to get Dimension vector ：

$a=\sum_{k=1}^K\textbf{1}X_k$

among $\textbf{1}\in \mathbb{R}^n$ For all 1 The row vector .

then adopt MLP Change the dimension into （ $\hat{a}=\sigma(aW_1)W_2$ , $\sigma$ by GELU）, Again reshape by $K\times c$ matrix $\hat{A}$ , Along the first dimension softmax Get back $\bar{A}$ , Generate a new feature map $\hat{X}$ .

$\hat{X}[i,:]=\sum_{k=1}^KX_k[i,:]\odot \bar{A}[k,:]$

among $\odot$ To multiply by elements .

S²-MLPv2：patch embedding layer + Several S²-MLPv2 block + Sort head

$Y=\textup{S}^2\textup{-MLPv2}(\textup{LN}(X))+X,Z=\textup{CM-MLP}(\textup{LN}(Y))+Y$

S²-MLPv2 block contain S²-MLPv2 component and channel-mixing MLP (CM-MLP).CM-MLP Structure and front MLP-Mixer（ see 7 Kind of vision MLP Arrangement （ On ） Two of them ） The structure in is the same .

S²-MLPv2 Block structure

$\textrm{MLP}_1$ take The dimension of becomes . Then it's divided into 3 The dimensions are Of feature map, Two of them are spatially shifted according to the figure below , The other remains unchanged .