Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

7 min readJul 12, 2021

--

Introduction

此篇為 T2T-ViT 的延伸閱讀，看格式是要發在 NeurIPS 的，原本的 ViT 和其他相關的架構提出固定長度的 Patch，但是通常長度越高，精準度越高，計算量越高，如何拿捏這些 trade-off 就是這篇想討論的，且又考慮到在不同的圖片和不同的物件需要的精度也不同，因此拿捏 patch(token) 的大小也十分重要。

因此這篇針對 Transformer 的架構提出了 Dynamic Vision Transformer(DVT)，透過遞增 token number 的方式訓練，讓其可以動態調整 token 大小，testing 的時候就會如下圖依序用不同的 patch size 直到指定的 confidence，此外提出了feature-wise 和 relationship-wise reuse 的機制降低運算量。

Method

透過訓練很多不同 patch size 的 Transformer 並透過 confidence score 達到 early-termination，以及使用 Feature Reuse、Relationship Reuse Blocks 減少運算量。

D_train: Training set
(x, y): Sample in D_train
L_CE: cross-entropy
p_i: softmax prediction probability output by the i-th exit.

複習一下 ViT，MSA 是 multiheaded self-attention，LN 是 Layernorm，l 表示 layer，MLP 是 Multilayer perceptron。

Feature reuse

透過 concatenate 的方式將前一層 Transformer 的 output feature 跟下一層的 Transformer 融合，在 concatenate 之前 output feature 會先透過下圖右邊的 LN-MLP 之後 upsample，之所以要先 reshape 是為了保留圖像空間， flatten 是因為 ViT 的 patch 架構。

E_l 是指 downstream model 的 layer-wise embedding，作者稱他為 context embedding，透過 F_l 讓維度從 D 變為 D’，再跟原本的 input feature map concatenate，所以實際上可以把(3)改成(5)，讓 upstream 的結果可以傳下來就不用 learn from scratch 了，效果也會比較好。

Relationship reuse

跟 Feature reuse 概念大致相同，只是是針對 attention 的部分。

公式如下，Q、K、V 是 attention 的 query、key、value，d 是 Q 和 K 的 hidden dimension，N_up 表示 upstream model 有多少 token，N_Att_up 表示 upstream model 有多少 attention maps，也就是 num of heads(N_H) * num of layers(L)，因此可以把(7)改成(9)。

而(9)的r()架構如下，概念也是 reshape、upsample、flatten，輸出的維度會和 multi-head attention 的 heads 數量相同。

Adaptive Inference

如上面 Figure 2，對每個 Transformer 的 output 做 softmax prediction，當最大的 p 大過某個 threshold 就會提早停止，不然就會被切成更多 tokens 往下個 Transformer 丟，最後一個 Transformer 就不會有 threshold，每個 Transformer 的 token embedding 的維度都不變。
threshold 的設計如下公式，D_val 是 validation set，B 是 computational budget，其值大於 0。

Experiment

Dataset: ImageNeT、 CIFAR-10/100

Backbones: T2T-ViT-12 [52], T2T-ViT-14 [52], and DeiT-smallll (w/o distillation)

在 ImageNet 上的結果，Figures 6 是用 T2T-ViT [52] 當 Backbones，Figures 7是用 DeiT [35] 當 Backbones，能提高不少速度。

這個表格是基於 T2T-ViT-12，兩個 exit 大小為 7x7 和14x14 的 DVT，可在較小 patch size 的 backbones 提高準確度，在較大 patch size 的 backbones 提升速度。

Ablation Study

這邊用基於 T2T-ViT-12 的 three-exit DVT 做實驗，並且把 early-termination 取消掉，雖然在第一個 Exit 有降一些準確度，但整體來說還是可以用很少的計算量換取準確度。

這邊針對每個 Reuse 再做 Ablation Study，這次用基於 T2T-ViT-12 的 three-exit(7x7, 10x10) DVT 做實驗。

以及針對 early-termination 的方式做比較。

最後可以看到第一個 exit 的(easy)和第三個 exit 的(hard) output 結果，可以觀察到確實 easy 的場景和物件相對單一，Figure 10 是表示 computational budget 增加的時候有多少圖片是從不同 exit 結束的。

Reference

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Machine Learning

Computer Vision

Written by Balin

NTUST CSIE

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams