当前位置：网站首页>Efficientnet series (1): efficientnetv2 network details

Efficientnet series (1): efficientnetv2 network details

2022-07-19 03:50:00 【@BangBang】

EfficicentNet Network profile

EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks, This paper is Google stay 2019 Articles published in .

EfficientNet This paper , The author is also about Input resolution , Network depth , Width Impact on accuracy , In the previous article is Separately increase the image resolution or increase the network depth or increase the network width , Try to improve the accuracy of the network . stay EfficientNet In this paper , The author used Network search technology NAS To explore the input resolution at the same time , Network depth 、 The effect of width .

EfficientNet What is the effect of ?
Insert picture description here
This picture is given by the author of the original paper about Efficient And a series of mainstream classification networks at that time Top-1 The accuracy of , We found that EfficientNet Not only is the number of parameters smaller than many mainstream models , The accuracy is obviously better .

It is mentioned in the paper that , What this article puts forward EfficientNet-B7 stay ImageNet top-1 The highest accuracy of that year 84.3%, With the highest accuracy before GPipe comparison , The number of parameters is only 1/8.4, Reasoning speed has increased 6.1 times

Network comparison （ Width 、 depth 、 The resolution of the ）

Insert picture description here - chart a Traditional convolutional neural networks

chart b, In the figure a On the basis of, the network is added separately Width ( Width represents Feature layer channel)
chart c, In the figure a On the basis of, the network is added separately depth , It is obvious that relative to the figure a, its layers Obviously more , The network will become deeper
chart d, In the figure a Based on the benchmark network The resolution of the Added , Improve the resolution of the image, and each feature matrix we get Height and width It will increase accordingly
chart e, Increase the width of the network at the same time 、 Depth and resolution of the input image
Based on past experience , Increase the depth of the network depth Can get richer 、 Complex features and can be well applied to other tasks . But if the depth of the network is too deep, the gradient will disappear , The problem of difficult training .
Increase the of the network width It can obtain finer grained features and is easier to train , But for the width Large and shallow networks are often difficult to learn deeper features .
Increase the input network Image resolution Be able to potentially obtain Higher granularity Feature template , But for very high input resolution , The gain of accuracy will also be reduced . And large resolution images will increase the amount of computation .

As can be seen from the above figure ,scale by width,scale by depth,scale by resolution, It is found that the accuracy of these three dotted lines basically reaches 80% After that, it is basically saturated and no longer increases . For the red line , We also increase the network Width 、 depth 、 The resolution of the , We found that it reached 80% There is no saturation after the accuracy of , And it can continue to grow . This shows that we increase the network at the same time depth 、 Width 、 The resolution of the Words , We can get a better result .

Insert picture description here
And when the theoretical calculations are the same , We also increase the depth of the network 、 Width 、 Resolution , The effect of network will be better .

EfficientNet-B0 Network

EfficientNet-B0 The Internet , The author also passed Network search technology Got , Its detailed network parameters are shown in the following table
Insert picture description here

EfficientNet-B0 Network structure

We found out Efficient in stage Altogether 1~9 individual stage.stage 1 It's a 3x3 The convolution of layer . about stage2~stage8 We can find out , It is stacking repeatedly MBConv, there MBConv Namely MobienetConv, I'll talk about .Stage 9 By 3 Part of the form :Conv 1x1 and Pooling and FC` layer .
The resolution here (Resolution), Corresponding to each Stage The height and width of
Channels, For each of us Stage Of the output characteristic matrix channel Number ,
Layers: Put our corresponding Operator How many repetitions , such as stage3 Corresponding Layers by 2, Would be right. MBConv6 Repeat twice
there stride Corresponding Layers Corresponding to the first layer stride, The other steps are equal to 1 Of .

EfficientNet-B0 Network

MBConv modular
Insert picture description here
In fact, the paper also said ,MBConv Actually sum MobileNet v3 The use of Block It's the same . Let's take a brief look at Efficient We use MBConv Its structure .

First of all, for our main branch , It's a 1x1 The convolution of is generally used to raise the dimension , And then through BN as well as Swish Activation function
Followed by a DW Convolution , Its convolution kernel is k x k,k May be 3 It could be 5, The step distance here may be 1 It could be 2.
And then DW The output of convolution passes BN and Swish After activating the function , Through a SE modular .
Followed by a 1x1 Convolution of , there 1x1 Convolution starts a dimensionality reduction function , Notice that there is only one BN, No, swish Activation function .
Followed by a dropout operation
Then input us into the characteristic matrix , From our shortcut Branches lead to , Directly with our main branch Output characteristic matrix Conduct Add up Get our corresponding Output .

Here are a few points to note ：

The first ascending convolution layer , The number of convolution kernels is the input characteristic matrix channel Of n times , there n What's the corresponding , It's us Operator Corresponding MBCov Corresponding number , Is our magnification factor n
about MBConv The last reduced convolution layer , How many convolution kernels does it have , It corresponds to the table above Channels To set it up . here Channels How much , Here 1x1 The number of convolution kernels is equal to .
The first 2 One thing to pay attention to is when MBConv1 when , Is this time n=1 When , We don't need to 1x1 Convoluted , Because we know the first 1x1 Convolution mainly plays the role of increasing dimension , So when n=1 At that time, it is equivalent to no dimension upgrading . The corresponding is in the table Stage2 Corresponding operator yes MBConv1, It's here MBConv It's not 1x1 Convoluted
About shortcut Connect , Only if you enter MBConv Structure characteristic matrix and output characteristic matrix shape The same time exists

SE modular
Insert picture description here

First of all, for the input characteristic matrix feature map Every one of channel Average pool operation
, Then through two full connection layers .
Note that the activation function of the first full connection layer is Swish Activation function , The activation function of the second full connection layer makes sigmoid Activation function .
- The number of nodes in the first full connection layer is the input value MBConv Characteristic matrix channels Of 1/4, The number of nodes in the second full connection layer is equal to feature_map Of channels Number , there feature_map Just MBConv in DW Output characteristic matrix .

EfficientNet-B0~ EfficientNet-B7 Network parameters

Insert picture description here

EfficientNet-B0~ EfficientNet-B7 The structure of the network is the same , It's the Internet input_size,width_coefficient,depth_coefficient And other parameter settings are different .
width_coefficient representative channel The multiplier factor on the dimension , For example EfficientNetB0 Medium Stage1 Of 3x3 The number of convolution kernels used in the convolution layer is 32, So in B6 The middle is 32 x 1.8=57.6 Then round it to the nearest 8 Multiple integer is 56, other stage Empathy .
depth_coefficient representative depth The multiplier factor on the dimension （ Targeted only Stage2 To Stage8）, For example EfficientNetB0 in Stage7 Of L=4, So in B6 The middle is 4 x 2.6 =10.4, Then round up, that is 11
drop_connect_rate It corresponds to MBConv In the middle of dropout Random deactivation ratio of layers , Note that not all MBConv Layer of dropout All equal to 0.2. In the source implementation , Will all MBConv In structure dropout The random deactivation ratio of layers is from 0 Slowly increase to the given drop_connect_rate,
the last one dropout_rate The corresponding is EfficientNet Finally, the corresponding before the full connection layer dropout Deactivation ratio

Performance comparison

Insert picture description here

EfficientNet-B0 Compared with us ResNet-50 And us DenseNet-169, We can see that its accuracy is the highest , The number of parameters is the least , Its theoretical calculation is the lowest . Again B1~B7 A series of networks are compared
But in practice , First of all, its accuracy is really high , Then the number of parameters is really small , There is no doubt that . But there is a problem in online training Very occupy GPU Explicit memory of , Because in us EfficientNet As in the B4,B5,B6,B7 These models , The resolution of its input image is very large, which leads to the corresponding increase in the height and width of the output characteristic matrix of each layer structure . So the occupation of our video memory will also increase .
And for the direct comparison of speed Flops It's not quite right , In reality, the speed we are concerned about is actually the speed of reasoning on the device ; Real reasoning speed and Flops In fact, it is not directly related , It is also influenced by many other factors , So it will be more meaningful if you can give its reasoning time on some devices

原网站

版权声明
本文为[@BangBang]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170224468166.html