Detecting Attended Visual Targets in Video (CVPR2020)

3 min readJul 7, 2021

Introduction

透過深度學習做純影像的 Eye Tracking，主要有分三個流派，分別是：Gaze Target Prediction、Gaze Behavior Recognition、Social Gaze Analysis，而這篇主要是在做 Gaze Target Prediction，資料集如上圖有分 out of frame(看影像以外的地方)和in frame(看影像裡面的某個地方)，其貢獻如下：

1. A novel spatio-temporal deep learning architecture that learns to predict dynamic gaze targets in video.

2. A new VideoAttentionTarget dataset, containing dense annotations of attention targets with complex patterns of gaze behavior.

3. Demonstration that our model’s predicted attention map can achieve state-of-the art results on two social gaze behavior recognition tasks.

Method

主要有三個部分：head conditioning branch、main scene branch、recurrent attention prediction module。

head conditioning branch

Head Conv 就是 ResNet-50 後面接一個額外的 residual layer 和 average pooling layer，之後接上 Head Position，丟到 fully-connected layer(Attention Layer)。
不直接用位子而是用頭部的相對位置的黑色 bbox 當作 Head Position 的好處為可以學到相對的深度且更有效率。

main scene branch

Scene Conv = Head Conv，但這次的 Head Position 是直接 concate 到 input image 後面，作者表示可以當作 spatial reference 讓 model 學更快，輸出再和 attention map 相乘，然後後面 concate Head Feature Map，之後丟到 encoder(兩個 convolution layers)。

recurrent attention prediction module

透過 ConvLSTM 獲取 temporal 的資訊，然後丟到 4 個 deconv 的 decoder，output full-sized feature，讓模型可以透過每個 frame 所預測的 gaze position 校正出更準確的位置。

Heatmap Modulation

full-sized feature 乘上 α ，值很高表示 in-frame，會產生如最後的 final heat map，而這個 α 是透過"in-frame"的這個 block 學到的，其是兩個 convolutional layer + 1 個 fully-connected layer，modulation 是透過 element-wise 相減 (1-α) * normalized full-sized feature map，其實就類似 matting 的概念區分前後景的感覺。

Loss

Heatmap loss L_h is computed using MSE loss.
In-frame loss L_f is computed with binary cross entropy loss.
L_h 的部分是透過 target point 為中心產生 Guassian ground truth。

整體是先訓練在 GazeFollow 之後固定住 encoder 防止 overfitting 並訓練於 VideoAttentionTarget dataset。

Experiment

GazeFollow

僅用單張 frame 訓練於 GazeFollow。
Human 的部份是用人標出來的結果，可以當作是 upper bound。

VideoAttentionTarget

Area Under Curve (AUC): 前面提到的 L_h 計算 True or False 並透過 Receiver Operating Characteristic(ROC) 計算面積，詳細可參考這篇文章。

L2 Dist: 把圖像大小 normalize 到 1 之後計算圖像距離。

out of frame AP: 是透過前面的 α 跟 ground truth 計算的。

下面有順便做Ablation Study

No head position is when the head position image is not used.
No head features is when the head feature map from the Head Conv module is not provided.
No attention map is when attention map is not produced therefore the scene feature map is uniformly weighted.
No fusion is when the head feature map is only used to produce attention map and not concatenated with scene feature map for encoding.
No temporal is when ConvLSTM is not used.

Toddler

此外在 toddler dataset 也有做 testing，自閉症的人在社交活動的時候比較難以控制他的 gaze，此 model 提供的 heatmap 可以幫助其他 model 臨床檢驗，簡單來說透過物件類別和移動時間等去評估其 gaze 的能力。

VideoCoAtt

因為這 model 本身不會有頭部位置的資訊，因此這邊實驗透過 fine-tuned SSD-based [29] 的 head detector 得到 input head position，但這邊因為 label 不同的關係，所以他在 gaze target detection 的部份並沒有做 fine-tune。