Background Matting: The World is Your Green Screen (CVPR2020)

Balin

4 min readMay 5, 2021

前言

Image Matting 是透過 alpha 計算前後景的不透明度已達成去背效果。
平常許多方法是透過三色圖 trimap 計算 alpha 進行 matting，而此篇論文的作法是透過先拍攝一張只有背景的 2D 影像當成 input，之後在相同的地方基於相同的背景拍攝有人物的影像當成另一個 input，進而得到 alpha。
其中 trimap 是用黑/灰/白對應到背景/不確定區域/前景。

此篇論文使用了兩個 network，分別是自行提出的 residual-block-based encoder-decoder 和 matting network，為了在沒 label 的資料下縮小 domain gap，作者在第一個 network 後面訓練了一個 matting network，透過 discriminator 判斷圖像的品質。
input: 一張背景照和含有人物的影片或圖片（需固定拍攝設備的焦距和曝光）
Our contributions include:
1. The first trimap-free automatic matting algorithm that utilizes a casually captured background.
2. A novel matting architecture (Context Switching Block) to select among input cues.
3. A self-supervised adversarial training to improve mattes on real images.
4. Experimental comparisons to a variety of competing methods on wide range of inputs (handheld, fixed-camera, indoor, outdoor), demonstrating the relative success of our approach.

方法

1. Supervised Training on the Adobe Dataset

在 Adobe Matting Dataset 針對 non-transparent objects (450個 ground truth 挑 280個) 進行 training。

augmentation in MSCOCO dataset:

augment B to get B'. I and M are get by B

I : varying resolutions, re-scalings and horizontal flips.
B : gamma correction, gaussian noise around the foreground region.
S : erode (10-20 steps), dilate (15-30 steps) and blur (σ∈[3,5,7])

2. Adversarial Training on Unlabelled Real data

只有單純使用 Supervised Training 沒辦法 handle 很好，有以下幾個問題
1. 在手指、手臂、頭髮附近的部份容易被 copy 到 matte
2. segmentation failing.
3. significant parts of the foreground color matching the background color.
4. misalignment between the image and the background (we assume only small misalignment)
為了解決這個問題，作者使用 unlabelled 的 real data (real image + backgrounds) 配合 self-supervision 進行訓練，就是上面提到的 matting network。
為了讓 G_Real 和 G_Adobe 的 distribution 相似，因此都 init in the standard randomized way，此外將 G_Adobe 當 teacher，如下圖所示。

作者表示比起找 real data 附近的 minimum，透過 init G_Adobe 還有可能可以降低 getting stuck in the local minimum的機率。
其中 λ 設為 0.05 且每兩個 epoch 減半，讓其影響較小，提高 D 的效果。

資料集

Adobe Matting
MOCOCO

實驗結果

此論文可輸入單張影像或是影片，但他們主要的結果都是用 single-image的方式進行比較的，並沒有參考到影片的 temporal cues。

Results on Synthetic-Composite Adobe Datase

G_Adobe 訓練於 26.9k exemplars: 269 objects * 100 random backgrounds
Table1 比較 220 個合成影像: 11 human * 20 random backgrounds

BM: Bayesian Matting traditional - trimap-based method that can accept a known background.
CAM: Context-Aware Matting - trimap-based deep matting technique that predicts both alpha and foreground.
IM: Index Matting - trimap-based deep matting technique that predicts only alpha.
LFM: Late Fusion Matting - trimap-free deep matting algorithm that predicts only alpha.
Ours-Adobe: 只使用 G_Adobe
Trimap-10: dilated by 10
Trimap-20: dilated by 20
B’: translate ∈ N(0,3), rotate∈ N(0, 1.3°) and small scaling and shear followed by gamma correction γ∼ N(1, 0.12) and gaussian noise η∼ N(μ∈[−5,5], σ∈[2,4])
Rescaled all images to 512x512 to measure SAD and MSE
SAD: sum of absolute differences
MSE: mean squared error
作者的方式較以上需要 trimap 的方式效果來得好。

Results on Real Data

包含手持相機和用腳架拍的資料，拍的時候是 1920x1080 之後在 sementation mask 附近 crop 成 512x512 (network input size)
GAdobe: 重新訓練於 280k composites consisting of 280 objects from Adobe Dataset
G_Real: 分別訓練於手持相機和腳架拍的 Dataset，其中手持的有去計算 homography，資料大小為 18k frames，腳架的資料大小為 19k frames，另外還有新增 3390張 background frames (B’’)。
因為比較資訊缺失和方法差異性太大的關係所以這邊比 user study。

由於晃動造成 background 影像可能不含有當前影像的資訊，因此容易產生一些瑕疵，也就是 registration error。

抑或是太大的背景動態資訊也會導致這問題。

Ablation Studies

Motion cues (比較其他方法時沒使用)

也比較自己兩個 model 的 user study，G_Real 確實有比較好。

以下有個懸疑點，作者表示 G_Real 在手持的資料表現較好是因為資料是 train 在有 alignment error 的資料集上，但他又說他 train G_Adobe with alignment error 結果也沒比較好。(所以或許不是這問題?)

we suspect this is because it was trained with examples having alignment errors. (We did try training ‘Ours-Adobe’ with alignment errors introduced into B′ but found the results degraded overall)

CS Block

針對 residual-block-based encoder-decoder 和加上他的 CS Block 進行比較，沒有 CS Block 的部分就是簡單 concate (I, B’, S, M) 當成 input，作者認為這樣結果比較容易 focus 在 I 和 B’ 的 color difference，所以當顏色太像時容易有洞洞。