Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection (CVPR2021)

6 min readOct 22, 2021

Introduction

Human-Object Interaction (HOI) 除了要偵測到人以外也要同時預測 interactions，此篇(上圖 d)提出 Glance and Gaze Network (GGNet)，透過 interaction area 計算出 Action-aware points (ActPoints)，來解決上圖 (c) 對於單點太敏感的問題，並透過 Action-aware point matching (APM) 將 interaction 對應到 human-object，並採用 Focal loss (此篇稱為 Hard Negative Attentive loss, HNA) 解決 positive/negative samples imbalance 的問題。

Method

主要分成三個部分，分別是 interaction prediction、human-object pair matching 和 object detection，這三個使用相同的 backbone，而 human-object pair matching 是用來整合 interaction prediction 和 object detection 的資訊。

透過一個 glance step 和兩個 gaze steps 的 network 去預測 ActPoint，實驗設定為 25 個點(n = 25)。

Glance step

輸出 V 維的 interaction categories，並用 focal loss 進行 supervised 的訓練

Gaze steps

因為 F⁰ 是拿來訓練 Action-aware 的 feature 因此可以預測出 ActPoints 大概的位置，所以將其透過 5x5 conv 之後當成 F¹ (step 1)的 deformable conv 的 offset field，並將 offset field 當成 ActPoints，而為了確保 ActPoints 能夠合理的輸出，因此這邊 G¹ 跟 glance step 一樣做 V 維 element-wise 的 focal loss。
但由於實際上 human 和 interaction object 可能離很遠，所以又會出現 convolution 常見的 receptive field 的問題，因此透過 step 2 來 refine ActPoints 的位置，也是透過和 step 1 一樣的方法，只是這次 offset field 有額外加上 G¹ 的 Residual。

Action-aware Point Matching

不同的動作通常有不同的 spatial characteristics，因此作者提出 Action-aware point matching (APM)，訓練每種 interaction 對應的 location regressor，每個 regressor 會輸出 interaction point 到 human/object point 2 維的 offset，因此輸出的維度為 H/d × W/d ×4V。

Loss

因為有些 interaction categories 的 positive samples 數量很少容易造成 imbalance 的問題，因此提出 Hard negative attentive (HNA) loss 讓模型更能 focus 在 Hard negative samples。
論文中定義 Hard negative samples 為相同物件在訓練資料中有相關且有意義的相近內容，假如有一個 positive sample <human carry bicycle> 且 <human repair bicycle> 不是 positive sample，那對於 “repair” 的類別來說 <human repair bicycle >就是 Hard negative samples。
用下面的 Guassian Mask 當作 loss，因為 Ground truth 裡面其實有 Interation pixel 的位置，一般來說只會把 Truth label [0, 1]進行 loss 的計算，作者額外加上 Hard negative samples 當作 [-1, 0] 的 Negative label，至於其他的地方則設為 0。