EfficientNetV2: Smaller Models and Faster Training

Balin

9 min readMay 26, 2021

前言

慢慢加大圖像的 size 可以達到比較快速的訓練，但同時也會降低 accuracy，此篇透過 Progressive learning 的方式並適當的調整 regularization、dropout、data augmentation 可以改善這問題。
這種主題的論文都專注在怎麼比別人又快又準，不管是訓練還是 testing 上都是，而作法當然有許多種像是移除 batch normal、調整 hyperparameters、用 attention 或 Transformer，但這樣做通常會讓參數量變很多。
此篇使用 training-aware neural architecture search (NAS) and scaling 來減少訓練的時間和網路的參數量，其中 21k 是 ImageNet21k，大約是 10 倍大的 ImageNet ILSVRC2012。

作者提出以下幾個論點

(1) training with very large image sizes is slow.
(2) depthwise convolutions are slow in early layers.
(3) equally scaling up every stage is sub-optimal.

對於不同 size 的圖片應該要給予不同的 regularization，因為對於相同的 network 來說，小的圖片 size 會導致較小的 network capacity，所以需要比較弱的 regularization，反之大的圖片則需要比較強的 regularization 去處理 overfitting 的問題，因此本篇才使用 progressive learning 的方式，在最開始的 epochs，用較小的圖片和較弱的 regularization (e.g., dropout and data augmentation)，之後慢慢增加圖片 size 和增強 regularization，透過動態的調整這些參數可以讓訓練的時間變快且不降低準確率。

問題

以前在使用 NAS 的時候大家比較專注在各種任務的 network 架構上、hyperparameters、FLOPs、inference efficiency，而此篇用 NAS 去 optimize training 和 parameter efficiency。
與 EfficientNetV1 的目的和內容相同，但針對幾個部份進行改良。

1. Training with very large image sizes is slow

V1 在訓練大張圖片的時候需要很大量的 memory，雖然說有時候用小張圖片做訓練精確度會比較高。

2. Depthwise convolutions are slow in early layers

Depthwise convolutions have fewer parameters and FLOPs than regular convolutions, but they often cannot fully utilize modern accelerators.
有些論文為了更好的使用 mobile 或 server 的 accelerators，就將下圖的 depthwise conv3x3 和 conv1x1 改成一個 conv3x3(右)。

因此作者也嘗試在 V1 進行改良，將原本的 MBConv 改成 Fused-MBConv，比較如下圖，簡單來說 Fused 在 early layers 有幫助，反之則會增加參數及運算量，甚至降低準確率，因此兩者也需要找到一個平衡。

3. Equally scaling up every stage is sub-optimal

V1 提出的方法會將 depth/width/resolution 做一定比例的調整，但其實固定的比例並非最佳解，因此此篇針對各部份也有做一些改動，此外 V1 逐漸增加 image size 會導致大量的 memory consumption and slow training. 因此 V2 也改動了之前的 scaling rule 並限制最大的 image size。

方法

Training-Aware NAS and Scaling

透過訓練的方式找到最好的組合，此外 baseline為 V1 且已經針對 channel sizes search 過了，之前也沒針對 pooling、skip ops 做優化，因此這邊也沒打算做，所以可以減少 search space，之後 random 訓練做比較看哪個好，A = 準確率，S = normalized training step time，P = parameter size。

架構的部份包含：
1. convolutional operation types {MBConv, Fused-MBConv}.
2. number of layers, kernel size {3x3, 5x5}.
3. expansion ratio {1, 4, 6}.

EfficientNetV2 Architecture

跟 V1 有以下差別
1. 用了 MBConv 和 Fused- MBConv
2.對 MBConv 用較小的 expansion ratio 減少記憶體用量
3. 用 3x3 kernel sizes 但增加 layers 數量以保持 receptive fields
4. 移除了最後一個 stride-1 stage 以減少 parameter size 和 memory access

EfficientNetV2 Scaling

We scale up EfficientNetV2-S to obtain EfficientNetV2-M/L using similar compound scaling as (Tan & Le, 2019a), with a few additional optimizations:

(1) we restrict the maximum inference image size to 480, as very large images often lead to expensive memory and training speed overhead;

(2) as a heuristic, we also gradually add more layers to later stages (e.g., stage 5 and 6 in Table 4) in order to increase the network capacity without adding much runtime overhead

Training Speed Comparison

all models are trained with fixed image size without progressive learning.
其中 EffNet(reprod) 訓練於小 30% image size 的影像。

Progressive Learning

Progressive Learning with adaptive Regularization

下圖為 Progressive Learning 的概念。

從影像大小 S_0 和 regularization Φ_0 到 S_e 和 regularization Φ_e，透過內差的方式分成好幾個 stage (此篇設為 4 個)，下面有演算法流程，每個 stage 的 weights 都會保留到下個 stage。
Regularization 的部份只有 Dropout、RandAugment、Mixup。

這裡有不同 V2 model 的 stage 的設定，最小值都一樣但最大值不同。

實驗

主要是 ImageNet 和 transfer learning 在 CIFAR-10、CIFAR-100、Cars、Flowers 的結果。

ImageNet

結果表格很大，所以分兩次來截圖，21k 表示 pretrained 然後 finetuned on ImageNet ILSVRC2012。

圖的部分可以看到 V2 又快又便宜。

Transfer Learning Datasets

Ablation Studies

Performance with the same training

跟 V1 做的一些比較，此外也將 progressive learning 套用在 V1 上。

Scaling Down

用比較小的 V2-S 跟 V1 做比較，且這邊沒使用 progressive learning。

Progressive Learning for Different Networks

Importance of Adaptive Regularization

比較 random resize 和 progressive resize，可以驗證這篇的理論「uses much smaller regularization for small images at the early training epochs, allowing models to converge faster and achieve better final accuracy.」

Reference

[arxiv]

EfficientNetV2: Smaller Models and Faster Training

前言

問題

方法

Progressive Learning

實驗

Ablation Studies

Reference

Written by Balin

No responses yet