Learning-based Region Selection for End-to-End Gaze Estimation (BMVC2020)

6 min readNov 16, 2021

Introduction

作者認為 Appearance-based 的 Gaze estimation 如果用固定的位置當成 Input 很容易受到環境影響，且頭部方向如果很歪或有部份遮擋不見得能得到最 Informative 的區域，因此提出可以基於 Image content 動態選擇 Facial regions 的方式，貢獻如下。

提出可以動態選擇 Informative regions 的 Region Selection Network(RSN)和進行 Gaze 預測的 Gaze estimation network(Gaze Net) 的 End-to-end framework。
透過 Three-stage 的方式以及提出新的 Loss 去訓練 RSN module without the label.
SOTA within GazeCapture and cross-dataset evaluations, particularly for challenging cases, e.g. difficult lighting conditions, extreme head angles, self-occlusion.

Method

先將臉部的影像(Input image)透過 RSN，讓其從 Location pool 隨機選擇 M 個 Region 然後將表示原本位置用的 Region grids 和 Crop 下來的臉部資訊以及 Input image(只有某些實驗有 Input image) 丟到 Gaze Net，讓其可以辨識出哪些 Regions 比較適合 Gaze estimation 的任務。
Gaze Net 的部份會將 Region grids、Regions 和 Input image(可不輸入)經過 Backbone 後將其 Feature concatenate，之後丟掉三層的 FC layers 去預測 2D gaze direction(g)。

然而要有效的訓練 RSN 並不容易，即使是人類用標註的方式也無法準確的知道說哪裡是 Informative regions，而且也無法直接拿 Gaze Net 的 Accuracy 當作 Supervision signal，因為 RSN 會先許選擇有比較高 Gaze estimation error 的地方，因此很容易卡在 local minima，因此採用以下 Three stages 的方式進行訓練。

Training procedure

Stage one:

用 Supervised 的方式訓練 Gaze，此時的 Regions 是 Randomly selected。

Stage two:

透過 RSN(下標 _s) 和 Random(下標 _r) 的方式選取的 Regions 分別丟入 Gaze Net 然後分別輸出 g_s 和 g_r，並計算兩者個 estimation error e_s/e_r，但因為原本的 g_s/g_r 是二維球座標系的 yaw 和 pitch，所以要轉換為三維的向量然後算 cosine similarity。