Gaze Estimation using Transformer (2021)

3 min readSep 28, 2021

Introduction

因為目前來說 ViT 還算新，還有很多領域並沒有做相關的實驗，因此此篇就是來做 ViT 的實驗，主要就是想探討「If transformers are suitable for gaze estimation tasks?」。
論文是 NeurIPS 的格式，提出透過 two transformers 和 hybrid transformers 的方式做 gaze estimation，hybrid 在所有 gaze dataset 上的表現都比 two transformers 和 CNN 好且使用比較少的參數量。
ViT 的架構是用 google 提出的純 ViT，CNN backbone 為 ResNet-18。

Pure Transformers in Gaze Estimation

Hybrid Transformers in Gaze Estimation

Gaze estimation is a regression task and it is hard to predict the human gaze with a local patch such as a half of eye images.

這邊的做法和上面類似，只是不透過 linear model 做 projection 而是透過 ResNet-18 將 input face 輸出 (h, w, c) 的 feature，之後 reshape 成 (l, c)，直接把 l(h * w) 當成 patch 數量取代上面的 z_img，之後一樣 concate token 和加上 position encoding 丟進 transformer 和 MLP

Common setting:

Pure Transformers:

Hybrid Transformers:

將 ResNet 的 output 再丟進一個 (1, 1) convolution 降維。

下面是比較的表格，可以看到 Hybrid 的效果比較好。

參考 AFF-NET 加了 eye 和 face corner position 的資訊到最後一個 MLP，且沒有 pre-train 在 ETH-XGaze 上，比較兩者的 point-of-gaze (POG)。

Hyper-parameters in Hybrid Transformer

這邊分別對 transformer 的 Layers, Heads 數量和 Input dimensions 做 Ablation study，這邊的 Input dimension 是指 ResNet-18 後面接的 (1, 1) convolution output 的維度，也就是 input 給 transformer 的維度。